{"instructions": ["How did they obtain the dataset?", "What activation function do they use in their model?", "What baselines do they compare to?", "How are chunks defined?", "What features are extracted?", "What features are extracted?", "Was the approach used in this work to detect fake news fully supervised?", "Based on this paper, what is the more predictive set of features to detect fake news?", "How big is the dataset used in this work?", "How is a \"chunk of posts\" defined in this work?", "What baselines were used in this work?"], "outputs": ["public resources where suspicious Twitter accounts were annotated, list with another 32 Twitter accounts from BIBREF19 that are considered trustworthy", "relu, selu, tanh", "Top-$k$ replies, likes, or re-tweets, FacTweet (tweet-level), LR + All Features (chunk-level), LR + All Features (tweet-level), Tweet2vec, LR + Bag-of-words", "Chunks is group of tweets from single account that  is consecutive in time - idea is that this group can show secret intention of malicious accounts.", "Sentiment, Morality, Style, Words embeddings", "15 emotion types, sentiment classes, positive and negative, care, harm, fairness, cheating, loyalty, betrayal, authority, subversion, sanctity, and degradation, count of question marks, exclamation marks, consecutive characters and letters, links, hashtags, users' mentions, uppercase ratio, tweet length, words embeddings", "Yes", "words embeddings, style, and morality features", "Total dataset size: 171 account (522967 tweets)", "chunk consists of a sorted sequence of tweets labeled by the label of its corresponding account", "LR + Bag-of-words, Tweet2vec, LR + All Features (tweet-level), LR + All Features (chunk-level), FacTweet (tweet-level), Top-$k$ replies, likes, or re-tweets"], "input": "Introduction\nSocial media platforms have made the spreading of fake news easier, faster as well as able to reach a wider audience. Social media offer another feature which is the anonymity for the authors, and this opens the door to many suspicious individuals or organizations to utilize these platforms. Recently, there has been an increased number of spreading fake news and rumors over the web and social media BIBREF0. Fake news in social media vary considering the intention to mislead. Some of these news are spread with the intention to be ironic or to deliver the news in an ironic way (satirical news). Others, such as propaganda, hoaxes, and clickbaits, are spread to mislead the audience or to manipulate their opinions. In the case of Twitter, suspicious news annotations should be done on a tweet rather than an account level, since some accounts mix fake with real news. However, these annotations are extremely costly and time consuming \u2013 i.e., due to high volume of available tweets Consequently, a first step in this direction, e.g., as a pre-filtering step, can be viewed as the task of detecting fake news at the account level.\nThe main obstacle for detecting suspicious Twitter accounts is due to the behavior of mixing some real news with the misleading ones. Consequently, we investigate ways to detect suspicious accounts by considering their tweets in groups (chunks). Our hypothesis is that suspicious accounts have a unique pattern in posting tweet sequences. Since their intention is to mislead, the way they transition from one set of tweets to the next has a hidden signature, biased by their intentions. Therefore, reading these tweets in chunks has the potential to improve the detection of the fake news accounts.\nIn this work, we investigate the problem of discriminating between factual and non-factual accounts in Twitter. To this end, we collect a large dataset of tweets using a list of propaganda, hoax and clickbait accounts and compare different versions of sequential chunk-based approaches using a variety of feature sets against several baselines. Several approaches have been proposed for news verification, whether in social media (rumors detection) BIBREF0, BIBREF1, BIBREF2, BIBREF3, BIBREF4, or in news claims BIBREF5, BIBREF6, BIBREF7, BIBREF8. The main orientation in the previous works is to verify the textual claims/tweets but not their sources. To the best of our knowledge, this is the first work aiming to detect factuality at the account level, and especially from a textual perspective. Our contributions are:\n[leftmargin=4mm]\nWe propose an approach to detect non-factual Twitter accounts by treating post streams as a sequence of tweets' chunks. We test several semantic and dictionary-based features together with a neural sequential approach, and apply an ablation test to investigate their contribution.\nWe benchmark our approach against other approaches that discard the chronological order of the tweets or read the tweets individually. The results show that our approach produces superior results at detecting non-factual accounts.\nMethodology\nGiven a news Twitter account, we read its tweets from the account's timeline. Then we sort the tweets by the posting date in ascending way and we split them into $N$ chunks. Each chunk consists of a sorted sequence of tweets labeled by the label of its corresponding account. We extract a set of features from each chunk and we feed them into a recurrent neural network to model the sequential flow of the chunks' tweets. We use an attention layer with dropout to attend over the most important tweets in each chunk. Finally, the representation is fed into a softmax layer to produce a probability distribution over the account types and thus predict the factuality of the accounts. Since we have many chunks for each account, the label for an account is obtained by taking the majority class of the account's chunks.\nInput Representation. Let $t$ be a Twitter account that contains $m$ tweets. These tweets are sorted by date and split into a sequence of chunks $ck = \\langle ck_1, \\ldots , ck_n \\rangle $, where each $ck_i$ contains $s$ tweets. Each tweet in $ck_i$ is represented by a vector $v \\in {\\rm I\\!R}^d$ , where $v$ is the concatenation of a set of features' vectors, that is $v = \\langle f_1, \\ldots , f_n \\rangle $. Each feature vector $f_i$ is built by counting the presence of tweet's words in a set of lexical lists. The final representation of the tweet is built by averaging the single word vectors.\nFeatures. We argue that different kinds of features like the sentiment of the text, morality, and other text-based features are critical to detect the nonfactual Twitter accounts by utilizing their occurrence during reporting the news in an account's timeline. We employ a rich set of features borrowed from previous works in fake news, bias, and rumors detection BIBREF0, BIBREF1, BIBREF8, BIBREF9.\n[leftmargin=4mm]\nEmotion: We build an emotions vector using word occurrences of 15 emotion types from two available emotional lexicons. We use the NRC lexicon BIBREF10, which contains $\\sim $14K words labeled using the eight Plutchik's emotions BIBREF11. The other lexicon is SentiSense BIBREF12 which is a concept-based affective lexicon that attaches emotional meanings to concepts from the WordNet lexical database. It has $\\sim $5.5 words labeled with emotions from a set of 14 emotional categories We use the categories that do not exist in the NRC lexicon.\nSentiment: We extract the sentiment of the tweets by employing EffectWordNet BIBREF13, SenticNet BIBREF14, NRC BIBREF10, and subj_lexicon BIBREF15, where each has the two sentiment classes, positive and negative.\nMorality: Features based on morality foundation theory BIBREF16 where words are labeled in one of the following 10 categories (care, harm, fairness, cheating, loyalty, betrayal, authority, subversion, sanctity, and degradation).\nStyle: We use canonical stylistic features, such as the count of question marks, exclamation marks, consecutive characters and letters, links, hashtags, users' mentions. In addition, we extract the uppercase ratio and the tweet length.\nWords embeddings: We extract words embeddings of the words of the tweet using $Glove\\-840B-300d$ BIBREF17 pretrained model. The tweet final representation is obtained by averaging its words embeddings.\nModel. To account for chunk sequences we make use of a de facto standard approach and opt for a recurrent neural model using long short-term memory (LSTM) BIBREF18. In our model, the sequence consists of a sequence of tweets belonging to one chunk (Figure FIGREF11). The LSTM learns the hidden state $h\\textsubscript {t}$ by capturing the previous timesteps (past tweets). The produced hidden state $h\\textsubscript {t}$ at each time step is passed to the attention layer which computes a `context' vector $c\\textsubscript {t}$ as the weighted mean of the state sequence $h$ by:\nWhere $T$ is the total number of timesteps in the input sequence and $\\alpha \\textsubscript {tj}$ is a weight computed at each time step $j$ for each state hj.\nExperiments and Results\nData. We build a dataset of Twitter accounts based on two lists annotated in previous works. For the non-factual accounts, we rely on a list of 180 Twitter accounts from BIBREF1. This list was created based on public resources where suspicious Twitter accounts were annotated with the main fake news types (clickbait, propaganda, satire, and hoax). We discard the satire labeled accounts since their intention is not to mislead or deceive. On the other hand, for the factual accounts, we use a list with another 32 Twitter accounts from BIBREF19 that are considered trustworthy by independent third parties. We discard some accounts that publish news in languages other than English (e.g., Russian or Arabic). Moreover, to ensure the quality of the data, we remove the duplicate, media-based, and link-only tweets. For each account, we collect the maximum amount of tweets allowed by Twitter API. Table TABREF13 presents statistics on our dataset.\nBaselines. We compare our approach (FacTweet) to the following set of baselines:\n[leftmargin=4mm]\nLR + Bag-of-words: We aggregate the tweets of a feed and we use a bag-of-words representation with a logistic regression (LR) classifier.\nTweet2vec: We use the Bidirectional Gated recurrent neural network model proposed in BIBREF20. We keep the default parameters that were provided with the implementation. To represent the tweets, we use the decoded embedding produced by the model. With this baseline we aim at assessing if the tweets' hashtags may help detecting the non-factual accounts.\nLR + All Features (tweet-level): We extract all our features from each tweet and feed them into a LR classifier. Here, we do not aggregate over tweets and thus view each tweet independently.\nLR + All Features (chunk-level): We concatenate the features' vectors of the tweets in a chunk and feed them into a LR classifier.\nFacTweet (tweet-level): Similar to the FacTweet approach, but at tweet-level; the sequential flow of the tweets is not utilized. We aim at investigating the importance of the sequential flow of tweets.\nTop-$k$ replies, likes, or re-tweets: Some approaches in rumors detection use the number of replies, likes, and re-tweets to detect rumors BIBREF21. Thus, we extract top $k$ replied, liked or re-tweeted tweets from each account to assess the accounts factuality. We tested different $k$ values between 10 tweets to the max number of tweets from each account. Figure FIGREF24 shows the macro-F1 values for different $k$ values. It seems that $k=500$ for the top replied tweets achieves the highest result. Therefore, we consider this as a baseline.\nExperimental Setup. We apply a 5 cross-validation on the account's level. For the FacTweet model, we experiment with 25% of the accounts for validation and parameters selection. We use hyperopt library to select the hyper-parameters on the following values: LSTM layer size (16, 32, 64), dropout ($0.0-0.9$), activation function ($relu$, $selu$, $tanh$), optimizer ($sgd$, $adam$, $rmsprop$) with varying the value of the learning rate (1e-1,..,-5), and batch size (4, 8, 16). The validation split is extracted on the class level using stratified sampling: we took a random 25% of the accounts from each class since the dataset is unbalanced. Discarding the classes' size in the splitting process may affect the minority classes (e.g. hoax). For the baselines' classifier, we tested many classifiers and the LR showed the best overall performance.\nResults. Table TABREF25 presents the results. We present the results using a chunk size of 20, which was found to be the best size on the held-out data. Figure FIGREF24 shows the results of different chunks sizes. FacTweet performs better than the proposed baselines and obtains the highest macro-F1 value of $0.565$. Our results indicate the importance of taking into account the sequence of the tweets in the accounts' timelines. The sequence of these tweets is better captured by our proposed model sequence-agnostic or non-neural classifiers. Moreover, the results demonstrate that the features at tweet-level do not perform well to detect the Twitter accounts factuality, since they obtain a result near to the majority class ($0.18$). Another finding from our experiments shows that the performance of the Tweet2vec is weak. This demonstrates that tweets' hashtags are not informative to detect non-factual accounts. In Table TABREF25, we present ablation tests so as to quantify the contribution of subset of features. The results indicate that most performance gains come from words embeddings, style, and morality features. Other features (emotion and sentiment) show lower importance: nevertheless, they still improve the overall system performance (on average 0.35% Macro-F$_1$ improvement). These performance figures suggest that non-factual accounts use semantic and stylistic hidden signatures mostly while tweeting news, so as to be able to mislead the readers and behave as reputable (i.e., factual) sources. We leave a more fine-grained, diachronic analysis of semantic and stylistic features \u2013 how semantic and stylistic signature evolve across time and change across the accounts' timelines \u2013 for future work.\nConclusions\nIn this paper, we proposed a model that utilizes chunked timelines of tweets and a recurrent neural model in order to infer the factuality of a Twitter news account. Our experimental results indicate the importance of analyzing tweet stream into chunks, as well as the benefits of heterogeneous knowledge source (i.e., lexica as well as text) in order to capture factuality. In future work, we would like to extend this line of research with further in-depth analysis to understand the flow change of the used features in the accounts' streams. Moreover, we would like to take our approach one step further incorporating explicit temporal information, e.g., using timestamps. Crucially, we are also interested in developing a multilingual version of our approach, for instance by leveraging the now ubiquitous cross-lingual embeddings BIBREF22, BIBREF23.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What models are used in the experiment?", "What are the differences between this dataset and pre-existing ones?", "In what language are the tweets?", "What is the size of the new dataset?", "What is the size of the new dataset?", "What kinds of offensive content are explored?", "What is the best performing model?", "How many annotators participated?", "What is the definition of offensive language?", "What are the three layers of the annotation scheme?", "How long is the dataset for each step of hierarchy?"], "outputs": ["linear SVM, bidirectional Long Short-Term-Memory (BiLSTM), Convolutional Neural Network (CNN)", "no prior work has explored the target of the offensive language", "English", "14,100 tweets", "Dataset contains total of 14100 annotations.", "non-targeted profanity and swearing, targeted insults such as cyberbullying, offensive content related to ethnicity, gender or sexual orientation, political affiliation, religious belief, and anything belonging to hate speech", "CNN ", "five annotators", " Most prior work focuses on a different aspect of offensive language such as abusive language BIBREF0 , BIBREF1 , (cyber-)aggression BIBREF2 , (cyber-)bullying BIBREF3 , BIBREF4 , toxic comments INLINEFORM0 , hate speech BIBREF5 , BIBREF6 , BIBREF7 , BIBREF8 , BIBREF9 , BIBREF10 , and offensive language BIBREF11 . Prior work has focused on these aspects of offensive language in Twitter BIBREF3 , BIBREF7 , BIBREF8 , BIBREF11 , Wikipedia comments, and Facebook posts BIBREF2 .", "Level A: Offensive language Detection\n, Level B: Categorization of Offensive Language\n, Level C: Offensive Language Target Identification\n", "Level A: 14100 Tweets\nLevel B: 4640 Tweets\nLevel C: 4089 Tweets"], "input": "Introduction\nOffensive content has become pervasive in social media and a reason of concern for government organizations, online communities, and social media platforms. One of the most common strategies to tackle the problem is to train systems capable of recognizing offensive content, which then can be deleted or set aside for human moderation. In the last few years, there have been several studies published on the application of computational methods to deal with this problem. Most prior work focuses on a different aspect of offensive language such as abusive language BIBREF0 , BIBREF1 , (cyber-)aggression BIBREF2 , (cyber-)bullying BIBREF3 , BIBREF4 , toxic comments INLINEFORM0 , hate speech BIBREF5 , BIBREF6 , BIBREF7 , BIBREF8 , BIBREF9 , BIBREF10 , and offensive language BIBREF11 . Prior work has focused on these aspects of offensive language in Twitter BIBREF3 , BIBREF7 , BIBREF8 , BIBREF11 , Wikipedia comments, and Facebook posts BIBREF2 .\nRecently, Waseem et. al. ( BIBREF12 ) acknowledged the similarities among prior work and discussed the need for a typology that differentiates between whether the (abusive) language is directed towards a specific individual or entity or towards a generalized group and whether the abusive content is explicit or implicit. Wiegand et al. ( BIBREF11 ) followed this trend as well on German tweets. In their evaluation, they have a task to detect offensive vs not offensive tweets and a second task for distinguishing between the offensive tweets as profanity, insult, or abuse. However, no prior work has explored the target of the offensive language, which is important in many scenarios, e.g., when studying hate speech with respect to a specific target.\nTherefore, we expand on these ideas by proposing a a hierarchical three-level annotation model that encompasses:\nUsing this annotation model, we create a new large publicly available dataset of English tweets. The key contributions of this paper are as follows:\nRelated Work\nDifferent abusive and offense language identification sub-tasks have been explored in the past few years including aggression identification, bullying detection, hate speech, toxic comments, and offensive language.\nAggression identification: The TRAC shared task on Aggression Identification BIBREF2 provided participants with a dataset containing 15,000 annotated Facebook posts and comments in English and Hindi for training and validation. For testing, two different sets, one from Facebook and one from Twitter were provided. Systems were trained to discriminate between three classes: non-aggressive, covertly aggressive, and overtly aggressive.\nBullying detection: Several studies have been published on bullying detection. One of them is the one by xu2012learning which apply sentiment analysis to detect bullying in tweets. xu2012learning use topic models to to identify relevant topics in bullying. Another related study is the one by dadvar2013improving which use user-related features such as the frequency of profanity in previous messages to improve bullying detection.\nHate speech identification: It is perhaps the most widespread abusive language detection sub-task. There have been several studies published on this sub-task such as kwok2013locate and djuric2015hate who build a binary classifier to distinguish between `clean' comments and comments containing hate speech and profanity. More recently, Davidson et al. davidson2017automated presented the hate speech detection dataset containing over 24,000 English tweets labeled as non offensive, hate speech, and profanity.\nOffensive language: The GermEval BIBREF11 shared task focused on Offensive language identification in German tweets. A dataset of over 8,500 annotated tweets was provided for a course-grained binary classification task in which systems were trained to discriminate between offensive and non-offensive tweets and a second task where the organizers broke down the offensive class into three classes: profanity, insult, and abuse.\nToxic comments: The Toxic Comment Classification Challenge was an open competition at Kaggle which provided participants with comments from Wikipedia labeled in six classes: toxic, severe toxic, obscene, threat, insult, identity hate.\nWhile each of these sub-tasks tackle a particular type of abuse or offense, they share similar properties and the hierarchical annotation model proposed in this paper aims to capture this. Considering that, for example, an insult targeted at an individual is commonly known as cyberbulling and that insults targeted at a group are known as hate speech, we pose that OLID's hierarchical annotation model makes it a useful resource for various offensive language identification sub-tasks.\nHierarchically Modelling Offensive Content\nIn the OLID dataset, we use a hierarchical annotation model split into three levels to distinguish between whether language is offensive or not (A), and type (B) and target (C) of the offensive language. Each level is described in more detail in the following subsections and examples are shown in Table TABREF10 .\nLevel A: Offensive language Detection\nLevel A discriminates between offensive (OFF) and non-offensive (NOT) tweets.\nNot Offensive (NOT): Posts that do not contain offense or profanity;\nOffensive (OFF): We label a post as offensive if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. This category includes insults, threats, and posts containing profane language or swear words.\nLevel B: Categorization of Offensive Language\nLevel B categorizes the type of offense and two labels are used: targeted (TIN) and untargeted (INT) insults and threats.\nTargeted Insult (TIN): Posts which contain an insult/threat to an individual, group, or others (see next layer);\nUntargeted (UNT): Posts containing non-targeted profanity and swearing. Posts with general profanity are not targeted, but they contain non-acceptable language.\nLevel C: Offensive Language Target Identification\nLevel C categorizes the targets of insults and threats as individual (IND), group (GRP), and other (OTH).\nIndividual (IND): Posts targeting an individual. It can be a a famous person, a named individual or an unnamed participant in the conversation. Insults and threats targeted at individuals are often defined as cyberbulling.\nGroup (GRP): The target of these offensive posts is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or other common characteristic. Many of the insults and threats targeted at a group correspond to what is commonly understood as hate speech.\nOther (OTH): The target of these offensive posts does not belong to any of the previous two categories (e.g. an organization, a situation, an event, or an issue).\nData Collection\nThe data included in OLID has been collected from Twitter. We retrieved the data using the Twitter API by searching for keywords and constructions that are often included in offensive messages, such as `she is' or `to:BreitBartNews'. We carried out a first round of trial annotation of 300 instances with six experts. The goal of the trial annotation was to 1) evaluate the proposed tagset; 2) evaluate the data retrieval method; and 3) create a gold standard with instances that could be used as test questions in the training and test setting annotation which was carried out using crowdsourcing. The breakdown of keywords and their offensive content in the trial data of 300 tweets is shown in Table TABREF14 . We included a left (@NewYorker) and far-right (@BreitBartNews) news accounts because there tends to be political offense in the comments. One of the best offensive keywords was tweets that were flagged as not being safe by the Twitter `safe' filter (the `-' indicates `not safe'). The vast majority of content on Twitter is not offensive so we tried different strategies to keep a reasonable number of tweets in the offensive class amounting to around 30% of the dataset including excluding some keywords that were not high in offensive content such as `they are` and `to:NewYorker`. Although `he is' is lower in offensive content we kept it as a keyword to avoid gender bias. In addition to the keywords in the trial set, we searched for more political keywords which tend to be higher in offensive content, and sampled our dataset such that 50% of the the tweets come from political keywords and 50% come from non-political keywords. In addition to the keywords `gun control', and `to:BreitbartNews', political keywords used to collect these tweets are `MAGA', `antifa', `conservative' and `liberal'. We computed Fliess' INLINEFORM0 on the trial set for the five annotators on 21 of the tweets. INLINEFORM1 is .83 for Layer A (OFF vs NOT) indicating high agreement. As to normalization and anonymization, no user metadata or Twitter IDs have been stored, and URLs and Twitter mentions have been substituted to placeholders. We follow prior work in related areas (burnap2015cyber,davidson2017automated) and annotate our data using crowdsourcing using the platform Figure Eight. We ensure data quality by: 1) we only received annotations from individuals who were experienced in the platform; and 2) we used test questions to discard annotations of individuals who did not reach a certain threshold. Each instance in the dataset was annotated by multiple annotators and inter-annotator agreement has been calculated. We first acquired two annotations for each instance. In case of 100% agreement, we considered these as acceptable annotations, and in case of disagreement, we requested more annotations until the agreement was above 66%. After the crowdsourcing annotation, we used expert adjudication to guarantee the quality of the annotation. The breakdown of the data into training and testing for the labels from each level is shown in Table TABREF15 .\nExperiments and Evaluation\nWe assess our dataset using traditional and deep learning methods. Our simplest model is a linear SVM trained on word unigrams. SVMs have produced state-of-the-art results for many text classification tasks BIBREF13 . We also train a bidirectional Long Short-Term-Memory (BiLSTM) model, which we adapted from the sentiment analysis system of sentimentSystem,rasooli2018cross and altered to predict offensive labels instead. It consists of (1) an input embedding layer, (2) a bidirectional LSTM layer, (3) an average pooling layer of input features. The concatenation of the LSTM's and average pool layer is passed through a dense layer and the output is passed through a softmax function. We set two input channels for the input embedding layers: pre-trained FastText embeddings BIBREF14 , as well as updatable embeddings learned by the model during training. Finally, we also apply a Convolutional Neural Network (CNN) model based on the architecture of BIBREF15 , using the same multi-channel inputs as the above BiLSTM.\nOur models are trained on the training data, and evaluated by predicting the labels for the held-out test set. The distribution is described in Table TABREF15 . We evaluate and compare the models using the macro-averaged F1-score as the label distribution is highly imbalanced. Per-class Precision (P), Recall (R), and F1-score (F1), also with other averaged metrics are also reported. The models are compared against baselines of predicting all labels as the majority or minority classes.\nOffensive Language Detection\nThe performance on discriminating between offensive (OFF) and non-offensive (NOT) posts is reported in Table TABREF18 . We can see that all systems perform significantly better than chance, with the neural models being substantially better than the SVM. The CNN outperforms the RNN model, achieving a macro-F1 score of 0.80.\nCategorization of Offensive Language\nIn this experiment, the two systems were trained to discriminate between insults and threats (TIN) and untargeted (UNT) offenses, which generally refer to profanity. The results are shown in Table TABREF19 .\nThe CNN system achieved higher performance in this experiment compared to the BiLSTM, with a macro-F1 score of 0.69. All systems performed better at identifying target and threats (TIN) than untargeted offenses (UNT).\nOffensive Language Target Identification\nThe results of the offensive target identification experiment are reported in Table TABREF20 . Here the systems were trained to distinguish between three targets: a group (GRP), an individual (IND), or others (OTH). All three models achieved similar results far surpassing the random baselines, with a slight performance edge for the neural models.\nThe performance of all systems for the OTH class is 0. This poor performances can be explained by two main factors. First, unlike the two other classes, OTH is a heterogeneous collection of targets. It includes offensive tweets targeted at organizations, situations, events, etc. making it more challenging for systems to learn discriminative properties of this class. Second, this class contains fewer training instances than the other two. There are only 395 instances in OTH, and 1,075 in GRP, and 2,407 in IND.\nConclusion and Future Work\nThis paper presents OLID, a new dataset with annotation of type and target of offensive language. OLID is the official dataset of the shared task SemEval 2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) BIBREF16 . In OffensEval, each annotation level in OLID is an independent sub-task. The dataset contains 14,100 tweets and is released freely to the research community. To the best of our knowledge, this is the first dataset to contain annotation of type and target of offenses in social media and it opens several new avenues for research in this area. We present baseline experiments using SVMs and neural networks to identify the offensive tweets, discriminate between insults, threats, and profanity, and finally to identify the target of the offensive messages. The results show that this is a challenging task. A CNN-based sentence classifier achieved the best results in all three sub-tasks.\nIn future work, we would like to make a cross-corpus comparison of OLID and datasets annotated for similar tasks such as aggression identification BIBREF2 and hate speech detection BIBREF8 . This comparison is, however, far from trivial as the annotation of OLID is different.\nAcknowledgments\nThe research presented in this paper was partially supported by an ERAS fellowship awarded by the University of Wolverhampton.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What is the approach of previous work?", "Is the lexicon the same for all languages?", "How do they obtain the lexicon?", "What evaluation metric is used?", "Which languages are similar to each other?", "Which datasets are employed for South African languages LID?", "Does the paper report the performance of a baseline model on South African languages LID?", "Does the algorithm improve on the state-of-the-art methods?"], "outputs": ["'shallow' naive Bayes, SVM, hierarchical stacked classifiers, bidirectional recurrent neural networks", "Yes", "built over all the data and therefore includes the vocabulary from both the training and testing sets", "average classification accuracy, execution performance", "Nguni languages (zul, xho, nbl, ssw), Sotho languages (nso, sot, tsn)", "DSL 2015, DSL 2017, JW300 parallel corpus , NCHLT text corpora", "Yes", "Yes"], "input": "Introduction\nAccurate language identification (LID) is the first step in many natural language processing and machine comprehension pipelines. If the language of a piece of text is known then the appropriate downstream models like parts of speech taggers and language models can be applied as required.\nLID is further also an important step in harvesting scarce language resources. Harvested data can be used to bootstrap more accurate LID models and in doing so continually improve the quality of the harvested data. Availability of data is still one of the big roadblocks for applying data driven approaches like supervised machine learning in developing countries.\nHaving 11 official languages of South Africa has lead to initiatives (discussed in the next section) that have had positive effect on the availability of language resources for research. However, many of the South African languages are still under resourced from the point of view of building data driven models for machine comprehension and process automation.\nTable TABREF2 shows the percentages of first language speakers for each of the official languages of South Africa. These are four conjunctively written Nguni languages (zul, xho, nbl, ssw), Afrikaans (afr) and English (eng), three disjunctively written Sotho languages (nso, sot, tsn), as well as tshiVenda (ven) and Xitsonga (tso). The Nguni languages are similar to each other and harder to distinguish. The same is true of the Sotho languages.\nThis paper presents a hierarchical naive Bayesian and lexicon based classifier for LID of short pieces of text of 15-20 characters long. The algorithm is evaluated against recent approaches using existing test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) 2015 and 2017 shared tasks.\nSection SECREF2 reviews existing works on the topic and summarises the remaining research problems. Section SECREF3 of the paper discusses the proposed algorithm and Section SECREF4 presents comparative results.\nRelated Works\nThe focus of this section is on recently published datasets and LID research applicable to the South African context. An in depth survey of algorithms, features, datasets, shared tasks and evaluation methods may be found in BIBREF0.\nThe datasets for the DSL 2015 & DSL 2017 shared tasks BIBREF1 are often used in LID benchmarks and also available on Kaggle . The DSL datasets, like other LID datasets, consists of text sentences labelled by language. The 2017 dataset, for example, contains 14 languages over 6 language groups with 18000 training samples and 1000 testing samples per language.\nThe recently published JW300 parallel corpus BIBREF2 covers over 300 languages with around 100 thousand parallel sentences per language pair on average. In South Africa, a multilingual corpus of academic texts produced by university students with different mother tongues is being developed BIBREF3. The WiLI-2018 benchmark dataset BIBREF4 for monolingual written natural language identification includes around 1000 paragraphs of 235 languages. A possibly useful link can also be made BIBREF5 between Native Language Identification (NLI) (determining the native language of the author of a text) and Language Variety Identification (LVI) (classification of different varieties of a single language) which opens up more datasets. The Leipzig Corpora Collection BIBREF6, the Universal Declaration of Human Rights and Tatoeba are also often used sources of data.\nThe NCHLT text corpora BIBREF7 is likely a good starting point for a shared LID task dataset for the South African languages BIBREF8. The NCHLT text corpora contains enough data to have 3500 training samples and 600 testing samples of 300+ character sentences per language. Researchers have recently started applying existing algorithms for tasks like neural machine translation in earnest to such South African language datasets BIBREF9.\nExisting NLP datasets, models and services BIBREF10 are available for South African languages. These include an LID algorithm BIBREF11 that uses a character level n-gram language model. Multiple papers have shown that 'shallow' naive Bayes classifiers BIBREF12, BIBREF8, BIBREF13, BIBREF14, SVMs BIBREF15 and similar models work very well for doing LID. The DSL 2017 paper BIBREF1, for example, gives an overview of the solutions of all of the teams that competed on the shared task and the winning approach BIBREF16 used an SVM with character n-gram, parts of speech tag features and some other engineered features. The winning approach for DSL 2015 used an ensemble naive Bayes classifier. The fasttext classifier BIBREF17 is perhaps one of the best known efficient 'shallow' text classifiers that have been used for LID .\nMultiple papers have proposed hierarchical stacked classifiers (including lexicons) that would for example first classify a piece of text by language group and then by exact language BIBREF18, BIBREF19, BIBREF8, BIBREF0. Some work has also been done on classifying surnames between Tshivenda, Xitsonga and Sepedi BIBREF20. Additionally, data augmentation BIBREF21 and adversarial training BIBREF22 approaches are potentially very useful to reduce the requirement for data.\nResearchers have investigated deeper LID models like bidirectional recurrent neural networks BIBREF23 or ensembles of recurrent neural networks BIBREF24. The latter is reported to achieve 95.12% in the DSL 2015 shared task. In these models text features can include character and word n-grams as well as informative character and word-level features learnt BIBREF25 from the training data. The neural methods seem to work well in tasks where more training data is available.\nIn summary, LID of short texts, informal styles and similar languages remains a difficult problem which is actively being researched. Increased confusion can in general be expected between shorter pieces of text and languages that are more closely related. Shallow methods still seem to work well compared to deeper models for LID. Other remaining research opportunities seem to be data harvesting, building standardised datasets and creating shared tasks for South Africa and Africa. Support for language codes that include more languages seems to be growing and discoverability of research is improving with more survey papers coming out. Paywalls also seem to no longer be a problem; the references used in this paper was either openly published or available as preprint papers.\nMethodology\nThe proposed LID algorithm builds on the work in BIBREF8 and BIBREF26. We apply a naive Bayesian classifier with character (2, 4 & 6)-grams, word unigram and word bigram features with a hierarchical lexicon based classifier.\nThe naive Bayesian classifier is trained to predict the specific language label of a piece of text, but used to first classify text as belonging to either the Nguni family, the Sotho family, English, Afrikaans, Xitsonga or Tshivenda. The scikit-learn multinomial naive Bayes classifier is used for the implementation with an alpha smoothing value of 0.01 and hashed text features.\nThe lexicon based classifier is then used to predict the specific language within a language group. For the South African languages this is done for the Nguni and Sotho groups. If the lexicon prediction of the specific language has high confidence then its result is used as the final label else the naive Bayesian classifier's specific language prediction is used as the final result. The lexicon is built over all the data and therefore includes the vocabulary from both the training and testing sets.\nThe lexicon based classifier is designed to trade higher precision for lower recall. The proposed implementation is considered confident if the number of words from the winning language is at least one more than the number of words considered to be from the language scored in second place.\nThe stacked classifier is tested against three public LID implementations BIBREF17, BIBREF23, BIBREF8. The LID implementation described in BIBREF17 is available on GitHub and is trained and tested according to a post on the fasttext blog. Character (5-6)-gram features with 16 dimensional vectors worked the best. The implementation discussed in BIBREF23 is available from https://github.com/tomkocmi/LanideNN. Following the instructions for an OSX pip install of an old r0.8 release of TensorFlow, the LanideNN code could be executed in Python 3.7.4. Settings were left at their defaults and a learning rate of 0.001 was used followed by a refinement with learning rate of 0.0001. Only one code modification was applied to return the results from a method that previously just printed to screen. The LID algorithm described in BIBREF8 is also available on GitHub.\nThe stacked classifier is also tested against the results reported for four other algorithms BIBREF16, BIBREF26, BIBREF24, BIBREF15. All the comparisons are done using the NCHLT BIBREF7, DSL 2015 BIBREF19 and DSL 2017 BIBREF1 datasets discussed in Section SECREF2.\nResults and Analysis\nThe average classification accuracy results are summarised in Table TABREF9. The accuracies reported are for classifying a piece of text by its specific language label. Classifying text only by language group or family is a much easier task as reported in BIBREF8.\nDifferent variations of the proposed classifier were evaluated. A single NB classifier (NB), a stack of two NB classifiers (NB+NB), a stack of a NB classifier and lexicon (NB+Lex) and a lexicon (Lex) by itself. A lexicon with a 50% training token dropout is also listed to show the impact of the lexicon support on the accuracy.\nFrom the results it seems that the DSL 2017 task might be harder than the DSL 2015 and NCHLT tasks. Also, the results for the implementation discussed in BIBREF23 might seem low, but the results reported in that paper is generated on longer pieces of text so lower scores on the shorter pieces of text derived from the NCHLT corpora is expected.\nThe accuracy of the proposed algorithm seems to be dependent on the support of the lexicon. Without a good lexicon a non-stacked naive Bayesian classifier might even perform better.\nThe execution performance of some of the LID implementations are shown in Table TABREF10. Results were generated on an early 2015 13-inch Retina MacBook Pro with a 2.9 GHz CPU (Turbo Boosted to 3.4 GHz) and 8GB RAM. The C++ implementation in BIBREF17 is the fastest. The implementation in BIBREF8 makes use of un-hashed feature representations which causes it to be slower than the proposed sklearn implementation. The execution performance of BIBREF23 might improve by a factor of five to ten when executed on a GPU.\nConclusion\nLID of short texts, informal styles and similar languages remains a difficult problem which is actively being researched. The proposed algorithm was evaluated on three existing datasets and compared to the implementations of three public LID implementations as well as to reported results of four other algorithms. It performed well relative to the other methods beating their results. However, the performance is dependent on the support of the lexicon.\nWe would like to investigate the value of a lexicon in a production system and how to possibly maintain it using self-supervised learning. We are investigating the application of deeper language models some of which have been used in more recent DSL shared tasks. We would also like to investigate data augmentation strategies to reduce the amount of training data that is required.\nFurther research opportunities include data harvesting, building standardised datasets and shared tasks for South Africa as well as the rest of Africa. In general, the support for language codes that include more languages seems to be growing, discoverability of research is improving and paywalls seem to no longer be a big problem in getting access to published research.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["Do they report results only on English data?", "Do they report results only on English data?", "What evidence do the authors present that the model can capture some biases in data annotation and collection?", "Which publicly available datasets are used?", "What baseline is used?", "What new fine-tuning methods are presented?", "What are the existing biases?", "What biases does their model capture?"], "outputs": ["Yes", "The authors showed few tweets where neither and implicit hatred content exist but the model was able to discriminate", "Waseem-dataset, Davidson-dataset,", "Waseem and Hovy BIBREF5, Davidson et al. BIBREF9, and Waseem et al. BIBREF10", "BERT based fine-tuning, Insert nonlinear layers, Insert Bi-LSTM layer, Insert CNN layer", "sampling tweets from specific keywords create systematic and substancial racial biases in datasets", "Data annotation biases where tweet containing disrespectful words are annotated as hate or offensive without any presumption about the social context of tweeters", "Waseem and Hovy BIBREF5, Davidson et al. BIBREF9, and Waseem et al. BIBREF10"], "input": "Introduction\nPeople are increasingly using social networking platforms such as Twitter, Facebook, YouTube, etc. to communicate their opinions and share information. Although the interactions among users on these platforms can lead to constructive conversations, they have been increasingly exploited for the propagation of abusive language and the organization of hate-based activities BIBREF0, BIBREF1, especially due to the mobility and anonymous environment of these online platforms. Violence attributed to online hate speech has increased worldwide. For example, in the UK, there has been a significant increase in hate speech towards the immigrant and Muslim communities following the UK's leaving the EU and the Manchester and London attacks. The US also has been a marked increase in hate speech and related crime following the Trump election. Therefore, governments and social network platforms confronting the trend must have tools to detect aggressive behavior in general, and hate speech in particular, as these forms of online aggression not only poison the social climate of the online communities that experience it, but can also provoke physical violence and serious harm BIBREF1.\nRecently, the problem of online abusive detection has attracted scientific attention. Proof of this is the creation of the third Workshop on Abusive Language Online or Kaggle\u2019s Toxic Comment Classification Challenge that gathered 4,551 teams in 2018 to detect different types of toxicities (threats, obscenity, etc.). In the scope of this work, we mainly focus on the term hate speech as abusive content in social media, since it can be considered a broad umbrella term for numerous kinds of insulting user-generated content. Hate speech is commonly defined as any communication criticizing a person or a group based on some characteristics such as gender, sexual orientation, nationality, religion, race, etc. Hate speech detection is not a stable or simple target because misclassification of regular conversation as hate speech can severely affect users\u2019 freedom of expression and reputation, while misclassification of hateful conversations as unproblematic would maintain the status of online communities as unsafe environments BIBREF2.\nTo detect online hate speech, a large number of scientific studies have been dedicated by using Natural Language Processing (NLP) in combination with Machine Learning (ML) and Deep Learning (DL) methods BIBREF3, BIBREF4, BIBREF5, BIBREF6, BIBREF7, BIBREF0. Although supervised machine learning-based approaches have used different text mining-based features such as surface features, sentiment analysis, lexical resources, linguistic features, knowledge-based features or user-based and platform-based metadata BIBREF8, BIBREF9, BIBREF10, they necessitate a well-defined feature extraction approach. The trend now seems to be changing direction, with deep learning models being used for both feature extraction and the training of classifiers. These newer models are applying deep learning approaches such as Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs), etc.BIBREF6, BIBREF0 to enhance the performance of hate speech detection models, however, they still suffer from lack of labelled data or inability to improve generalization property.\nHere, we propose a transfer learning approach for hate speech understanding using a combination of the unsupervised pre-trained model BERT BIBREF11 and some new supervised fine-tuning strategies. As far as we know, it is the first time that such exhaustive fine-tuning strategies are proposed along with a generative pre-trained language model to transfer learning to low-resource hate speech languages and improve performance of the task. In summary:\nWe propose a transfer learning approach using the pre-trained language model BERT learned on English Wikipedia and BookCorpus to enhance hate speech detection on publicly available benchmark datasets. Toward that end, for the first time, we introduce new fine-tuning strategies to examine the effect of different embedding layers of BERT in hate speech detection.\nOur experiment results show that using the pre-trained BERT model and fine-tuning it on the downstream task by leveraging syntactical and contextual information of all BERT's transformers outperforms previous works in terms of precision, recall, and F1-score. Furthermore, examining the results shows the ability of our model to detect some biases in the process of collecting or annotating datasets. It can be a valuable clue in using pre-trained BERT model for debiasing hate speech datasets in future studies.\nPrevious Works\nHere, the existing body of knowledge on online hate speech and offensive language and transfer learning is presented.\nOnline Hate Speech and Offensive Language: Researchers have been studying hate speech on social media platforms such as Twitter BIBREF9, Reddit BIBREF12, BIBREF13, and YouTube BIBREF14 in the past few years. The features used in traditional machine learning approaches are the main aspects distinguishing different methods, and surface-level features such as bag of words, word-level and character-level $n$-grams, etc. have proven to be the most predictive features BIBREF3, BIBREF4, BIBREF5. Apart from features, different algorithms such as Support Vector Machines BIBREF15, Naive Baye BIBREF1, and Logistic Regression BIBREF5, BIBREF9, etc. have been applied for classification purposes. Waseem et al. BIBREF5 provided a test with a list of criteria based on the work in Gender Studies and Critical Race Theory (CRT) that can annotate a corpus of more than $16k$ tweets as racism, sexism, or neither. To classify tweets, they used a logistic regression model with different sets of features, such as word and character $n$-grams up to 4, gender, length, and location. They found that their best model produces character $n$-gram as the most indicative features, and using location or length is detrimental. Davidson et al. BIBREF9 collected a $24K$ corpus of tweets containing hate speech keywords and labelled the corpus as hate speech, offensive language, or neither by using crowd-sourcing and extracted different features such as $n$-grams, some tweet-level metadata such as the number of hashtags, mentions, retweets, and URLs, Part Of Speech (POS) tagging, etc. Their experiments on different multi-class classifiers showed that the Logistic Regression with L2 regularization performs the best at this task. Malmasi et al. BIBREF15 proposed an ensemble-based system that uses some linear SVM classifiers in parallel to distinguish hate speech from general profanity in social media.\nAs one of the first attempts in neural network models, Djuric et al. BIBREF16 proposed a two-step method including a continuous bag of words model to extract paragraph2vec embeddings and a binary classifier trained along with the embeddings to distinguish between hate speech and clean content. Badjatiya et al. BIBREF0 investigated three deep learning architectures, FastText, CNN, and LSTM, in which they initialized the word embeddings with either random or GloVe embeddings. Gamb\u00e4ck et al. BIBREF6 proposed a hate speech classifier based on CNN model trained on different feature embeddings such as word embeddings and character $n$-grams. Zhang et al. BIBREF7 used a CNN+GRU (Gated Recurrent Unit network) neural network model initialized with pre-trained word2vec embeddings to capture both word/character combinations (e. g., $n$-grams, phrases) and word/character dependencies (order information). Waseem et al. BIBREF10 brought a new insight to hate speech and abusive language detection tasks by proposing a multi-task learning framework to deal with datasets across different annotation schemes, labels, or geographic and cultural influences from data sampling. Founta et al. BIBREF17 built a unified classification model that can efficiently handle different types of abusive language such as cyberbullying, hate, sarcasm, etc. using raw text and domain-specific metadata from Twitter. Furthermore, researchers have recently focused on the bias derived from the hate speech training datasets BIBREF18, BIBREF2, BIBREF19. Davidson et al. BIBREF2 showed that there were systematic and substantial racial biases in five benchmark Twitter datasets annotated for offensive language detection. Wiegand et al. BIBREF19 also found that classifiers trained on datasets containing more implicit abuse (tweets with some abusive words) are more affected by biases rather than once trained on datasets with a high proportion of explicit abuse samples (tweets containing sarcasm, jokes, etc.).\nTransfer Learning: Pre-trained vector representations of words, embeddings, extracted from vast amounts of text data have been encountered in almost every language-based tasks with promising results. Two of the most frequently used context-independent neural embeddings are word2vec and Glove extracted from shallow neural networks. The year 2018 has been an inflection point for different NLP tasks thanks to remarkable breakthroughs: Universal Language Model Fine-Tuning (ULMFiT) BIBREF20, Embedding from Language Models (ELMO) BIBREF21, OpenAI\u2019 s Generative Pre-trained Transformer (GPT) BIBREF22, and Google\u2019s BERT model BIBREF11. Howard et al. BIBREF20 proposed ULMFiT which can be applied to any NLP task by pre-training a universal language model on a general-domain corpus and then fine-tuning the model on target task data using discriminative fine-tuning. Peters et al. BIBREF21 used a bi-directional LSTM trained on a specific task to present context-sensitive representations of words in word embeddings by looking at the entire sentence. Radford et al. BIBREF22 and Devlin et al. BIBREF11 generated two transformer-based language models, OpenAI GPT and BERT respectively. OpenAI GPT BIBREF22 is an unidirectional language model while BERT BIBREF11 is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. BERT has two novel prediction tasks: Masked LM and Next Sentence Prediction. The pre-trained BERT model significantly outperformed ELMo and OpenAI GPT in a series of downstream tasks in NLP BIBREF11. Identifying hate speech and offensive language is a complicated task due to the lack of undisputed labelled data BIBREF15 and the inability of surface features to capture the subtle semantics in text. To address this issue, we use the pre-trained language model BERT for hate speech classification and try to fine-tune specific task by leveraging information from different transformer encoders.\nMethodology\nHere, we analyze the BERT transformer model on the hate speech detection task. BERT is a multi-layer bidirectional transformer encoder trained on the English Wikipedia and the Book Corpus containing 2,500M and 800M tokens, respectively, and has two models named BERTbase and BERTlarge. BERTbase contains an encoder with 12 layers (transformer blocks), 12 self-attention heads, and 110 million parameters whereas BERTlarge has 24 layers, 16 attention heads, and 340 million parameters. Extracted embeddings from BERTbase have 768 hidden dimensions BIBREF11. As the BERT model is pre-trained on general corpora, and for our hate speech detection task we are dealing with social media content, therefore as a crucial step, we have to analyze the contextual information extracted from BERT' s pre-trained layers and then fine-tune it using annotated datasets. By fine-tuning we update weights using a labelled dataset that is new to an already trained model. As an input and output, BERT takes a sequence of tokens in maximum length 512 and produces a representation of the sequence in a 768-dimensional vector. BERT inserts at most two segments to each input sequence, [CLS] and [SEP]. [CLS] embedding is the first token of the input sequence and contains the special classification embedding which we take the first token [CLS] in the final hidden layer as the representation of the whole sequence in hate speech classification task. The [SEP] separates segments and we will not use it in our classification task. To perform the hate speech detection task, we use BERTbase model to classify each tweet as Racism, Sexism, Neither or Hate, Offensive, Neither in our datasets. In order to do that, we focus on fine-tuning the pre-trained BERTbase parameters. By fine-tuning, we mean training a classifier with different layers of 768 dimensions on top of the pre-trained BERTbase transformer to minimize task-specific parameters.\nMethodology ::: Fine-Tuning Strategies\nDifferent layers of a neural network can capture different levels of syntactic and semantic information. The lower layer of the BERT model may contain more general information whereas the higher layers contain task-specific information BIBREF11, and we can fine-tune them with different learning rates. Here, four different fine-tuning approaches are implemented that exploit pre-trained BERTbase transformer encoders for our classification task. More information about these transformer encoders' architectures are presented in BIBREF11. In the fine-tuning phase, the model is initialized with the pre-trained parameters and then are fine-tuned using the labelled datasets. Different fine-tuning approaches on the hate speech detection task are depicted in Figure FIGREF8, in which $X_{i}$ is the vector representation of token $i$ in a tweet sample, and are explained in more detail as follows:\n1. BERT based fine-tuning: In the first approach, which is shown in Figure FIGREF8, very few changes are applied to the BERTbase. In this architecture, only the [CLS] token output provided by BERT is used. The [CLS] output, which is equivalent to the [CLS] token output of the 12th transformer encoder, a vector of size 768, is given as input to a fully connected network without hidden layer. The softmax activation function is applied to the hidden layer to classify.\n2. Insert nonlinear layers: Here, the first architecture is upgraded and an architecture with a more robust classifier is provided in which instead of using a fully connected network without hidden layer, a fully connected network with two hidden layers in size 768 is used. The first two layers use the Leaky Relu activation function with negative slope = 0.01, but the final layer, as the first architecture, uses softmax activation function as shown in Figure FIGREF8.\n3. Insert Bi-LSTM layer: Unlike previous architectures that only use [CLS] as the input for the classifier, in this architecture all outputs of the latest transformer encoder are used in such a way that they are given as inputs to a bidirectional recurrent neural network (Bi-LSTM) as shown in Figure FIGREF8. After processing the input, the network sends the final hidden state to a fully connected network that performs classification using the softmax activation function.\n4. Insert CNN layer: In this architecture shown in Figure FIGREF8, the outputs of all transformer encoders are used instead of using the output of the latest transformer encoder. So that the output vectors of each transformer encoder are concatenated, and a matrix is produced. The convolutional operation is performed with a window of size (3, hidden size of BERT which is 768 in BERTbase model) and the maximum value is generated for each transformer encoder by applying max pooling on the convolution output. By concatenating these values, a vector is generated which is given as input to a fully connected network. By applying softmax on the input, the classification operation is performed.\nExperiments and Results\nWe first introduce datasets used in our study and then investigate the different fine-tuning strategies for hate speech detection task. We also include the details of our implementation and error analysis in the respective subsections.\nExperiments and Results ::: Dataset Description\nWe evaluate our method on two widely-studied datasets provided by Waseem and Hovey BIBREF5 and Davidson et al. BIBREF9. Waseem and Hovy BIBREF5 collected $16k$ of tweets based on an initial ad-hoc approach that searched common slurs and terms related to religious, sexual, gender, and ethnic minorities. They annotated their dataset manually as racism, sexism, or neither. To extend this dataset, Waseem BIBREF23 also provided another dataset containing $6.9k$ of tweets annotated with both expert and crowdsourcing users as racism, sexism, neither, or both. Since both datasets are overlapped partially and they used the same strategy in definition of hateful content, we merged these two datasets following Waseem et al. BIBREF10 to make our imbalance data a bit larger. Davidson et al. BIBREF9 used the Twitter API to accumulate 84.4 million tweets from 33,458 twitter users containing particular terms from a pre-defined lexicon of hate speech words and phrases, called Hatebased.org. To annotate collected tweets as Hate, Offensive, or Neither, they randomly sampled $25k$ tweets and asked users of CrowdFlower crowdsourcing platform to label them. In detail, the distribution of different classes in both datasets will be provided in Subsection SECREF15.\nExperiments and Results ::: Pre-Processing\nWe find mentions of users, numbers, hashtags, URLs and common emoticons and replace them with the tokens <user>,<number>,<hashtag>,<url>,<emoticon>. We also find elongated words and convert them into short and standard format; for example, converting yeeeessss to yes. With hashtags that include some tokens without any with space between them, we replace them by their textual counterparts; for example, we convert hashtag \u201c#notsexist\" to \u201cnot sexist\". All punctuation marks, unknown uni-codes and extra delimiting characters are removed, but we keep all stop words because our model trains the sequence of words in a text directly. We also convert all tweets to lower case.\nExperiments and Results ::: Implementation and Results Analysis\nFor the implementation of our neural network, we used pytorch-pretrained-bert library containing the pre-trained BERT model, text tokenizer, and pre-trained WordPiece. As the implementation environment, we use Google Colaboratory tool which is a free research tool with a Tesla K80 GPU and 12G RAM. Based on our experiments, we trained our classifier with a batch size of 32 for 3 epochs. The dropout probability is set to 0.1 for all layers. Adam optimizer is used with a learning rate of 2e-5. As an input, we tokenized each tweet with the BERT tokenizer. It contains invalid characters removal, punctuation splitting, and lowercasing the words. Based on the original BERT BIBREF11, we split words to subword units using WordPiece tokenization. As tweets are short texts, we set the maximum sequence length to 64 and in any shorter or longer length case it will be padded with zero values or truncated to the maximum length.\nWe consider 80% of each dataset as training data to update the weights in the fine-tuning phase, 10% as validation data to measure the out-of-sample performance of the model during training, and 10% as test data to measure the out-of-sample performance after training. To prevent overfitting, we use stratified sampling to select 0.8, 0.1, and 0.1 portions of tweets from each class (racism/sexism/neither or hate/offensive/neither) for train, validation, and test. Classes' distribution of train, validation, and test datasets are shown in Table TABREF16.\nAs it is understandable from Tables TABREF16(classdistributionwaseem) and TABREF16(classdistributiondavidson), we are dealing with imbalance datasets with various classes\u2019 distribution. Since hate speech and offensive languages are real phenomena, we did not perform oversampling or undersampling techniques to adjust the classes\u2019 distribution and tried to supply the datasets as realistic as possible. We evaluate the effect of different fine-tuning strategies on the performance of our model. Table TABREF17 summarized the obtained results for fine-tuning strategies along with the official baselines. We use Waseem and Hovy BIBREF5, Davidson et al. BIBREF9, and Waseem et al. BIBREF10 as baselines and compare the results with our different fine-tuning strategies using pre-trained BERTbase model. The evaluation results are reported on the test dataset and on three different metrics: precision, recall, and weighted-average F1-score. We consider weighted-average F1-score as the most robust metric versus class imbalance, which gives insight into the performance of our proposed models. According to Table TABREF17, F1-scores of all BERT based fine-tuning strategies except BERT + nonlinear classifier on top of BERT are higher than the baselines. Using the pre-trained BERT model as initial embeddings and fine-tuning the model with a fully connected linear classifier (BERTbase) outperforms previous baselines yielding F1-score of 81% and 91% for datasets of Waseem and Davidson respectively. Inserting a CNN to pre-trained BERT model for fine-tuning on downstream task provides the best results as F1- score of 88% and 92% for datasets of Waseem and Davidson and it clearly exceeds the baselines. Intuitively, this makes sense that combining all pre-trained BERT layers with a CNN yields better results in which our model uses all the information included in different layers of pre-trained BERT during the fine-tuning phase. This information contains both syntactical and contextual features coming from lower layers to higher layers of BERT.\nExperiments and Results ::: Error Analysis\nAlthough we have very interesting results in term of recall, the precision of the model shows the portion of false detection we have. To understand better this phenomenon, in this section we perform a deep analysis on the error of the model. We investigate the test datasets and their confusion matrices resulted from the BERTbase + CNN model as the best fine-tuning approach; depicted in Figures FIGREF19 and FIGREF19. According to Figure FIGREF19 for Waseem-dataset, it is obvious that the model can separate sexism from racism content properly. Only two samples belonging to racism class are misclassified as sexism and none of the sexism samples are misclassified as racism. A large majority of the errors come from misclassifying hateful categories (racism and sexism) as hatless (neither) and vice versa. 0.9% and 18.5% of all racism samples are misclassified as sexism and neither respectively whereas it is 0% and 12.7% for sexism samples. Almost 12% of neither samples are misclassified as racism or sexism. As Figure FIGREF19 makes clear for Davidson-dataset, the majority of errors are related to hate class where the model misclassified hate content as offensive in 63% of the cases. However, 2.6% and 7.9% of offensive and neither samples are misclassified respectively.\nTo understand better the mislabeled items by our model, we did a manual inspection on a subset of the data and record some of them in Tables TABREF20 and TABREF21. Considering the words such as \u201cdaughters\", \u201cwomen\", and \u201cburka\" in tweets with IDs 1 and 2 in Table TABREF20, it can be understood that our BERT based classifier is confused with the contextual semantic between these words in the samples and misclassified them as sexism because they are mainly associated to femininity. In some cases containing implicit abuse (like subtle insults) such as tweets with IDs 5 and 7, our model cannot capture the hateful/offensive content and therefore misclassifies. It should be noticed that even for a human it is difficult to discriminate against this kind of implicit abuses.\nBy examining more samples and with respect to recently studies BIBREF2, BIBREF24, BIBREF19, it is clear that many errors are due to biases from data collection BIBREF19 and rules of annotation BIBREF24 and not the classifier itself. Since Waseem et al.BIBREF5 created a small ad-hoc set of keywords and Davidson et al.BIBREF9 used a large crowdsourced dictionary of keywords (Hatebase lexicon) to sample tweets for training, they included some biases in the collected data. Especially for Davidson-dataset, some tweets with specific language (written within the African American Vernacular English) and geographic restriction (United States of America) are oversampled such as tweets containing disparage words \u201cnigga\", \u201cfaggot\", \u201ccoon\", or \u201cqueer\", result in high rates of misclassification. However, these misclassifications do not confirm the low performance of our classifier because annotators tended to annotate many samples containing disrespectful words as hate or offensive without any presumption about the social context of tweeters such as the speaker\u2019s identity or dialect, whereas they were just offensive or even neither tweets. Tweets IDs 6, 8, and 10 are some samples containing offensive words and slurs which arenot hate or offensive in all cases and writers of them used this type of language in their daily communications. Given these pieces of evidence, by considering the content of tweets, we can see in tweets IDs 3, 4, and 9 that our BERT-based classifier can discriminate tweets in which neither and implicit hatred content exist. One explanation of this observation may be the pre-trained general knowledge that exists in our model. Since the pre-trained BERT model is trained on general corpora, it has learned general knowledge from normal textual data without any purposely hateful or offensive language. Therefore, despite the bias in the data, our model can differentiate hate and offensive samples accurately by leveraging knowledge-aware language understanding that it has and it can be the main reason for high misclassifications of hate samples as offensive (in reality they are more similar to offensive rather than hate by considering social context, geolocation, and dialect of tweeters).\nConclusion\nConflating hatred content with offensive or harmless language causes online automatic hate speech detection tools to flag user-generated content incorrectly. Not addressing this problem may bring about severe negative consequences for both platforms and users such as decreasement of platforms' reputation or users abandonment. Here, we propose a transfer learning approach advantaging the pre-trained language model BERT to enhance the performance of a hate speech detection system and to generalize it to new datasets. To that end, we introduce new fine-tuning strategies to examine the effect of different layers of BERT in hate speech detection task. The evaluation results indicate that our model outperforms previous works by profiting the syntactical and contextual information embedded in different transformer encoder layers of the BERT model using a CNN-based fine-tuning strategy. Furthermore, examining the results shows the ability of our model to detect some biases in the process of collecting or annotating datasets. It can be a valuable clue in using the pre-trained BERT model to alleviate bias in hate speech datasets in future studies, by investigating a mixture of contextual information embedded in the BERT\u2019s layers and a set of features associated to the different type of biases in data.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What does the \"sensitivity\" quantity denote?", "What end tasks do they evaluate on?", "What is a semicharacter architecture?", "Do they experiment with offering multiple candidate corrections and voting on the model output, since this seems highly likely to outperform a one-best correction?", "Why is the adversarial setting appropriate for misspelling recognition?", "How do the backoff strategies work?"], "outputs": ["the number of distinct word recognition outputs that an attacker can induce", "Sentiment analysis and paraphrase detection under adversarial attacks", "A semi-character based RNN (ScRNN) treats the first and last characters individually, and is agnostic to the ordering of the internal characters", "No", "Adversarial misspellings are a real-world problem", "Pass-through passes the possibly misspelled word as is, backoff to neutral word backs off to a word with similar distribution across classes and backoff to background model backs off to a more generic word recognition model trained with larger and less specialized corpus."], "input": "Introduction\nDespite the rapid progress of deep learning techniques on diverse supervised learning tasks, these models remain brittle to subtle shifts in the data distribution. Even when the permissible changes are confined to barely-perceptible perturbations, training robust models remains an open challenge. Following the discovery that imperceptible attacks could cause image recognition models to misclassify examples BIBREF0 , a veritable sub-field has emerged in which authors iteratively propose attacks and countermeasures.\nFor all the interest in adversarial computer vision, these attacks are rarely encountered outside of academic research. However, adversarial misspellings constitute a longstanding real-world problem. Spammers continually bombard email servers, subtly misspelling words in efforts to evade spam detection while preserving the emails' intended meaning BIBREF1 , BIBREF2 . As another example, programmatic censorship on the Internet has spurred communities to adopt similar methods to communicate surreptitiously BIBREF3 .\nIn this paper, we focus on adversarially-chosen spelling mistakes in the context of text classification, addressing the following attack types: dropping, adding, and swapping internal characters within words. These perturbations are inspired by psycholinguistic studies BIBREF4 , BIBREF5 which demonstrated that humans can comprehend text altered by jumbling internal characters, provided that the first and last characters of each word remain unperturbed.\nFirst, in experiments addressing both BiLSTM and fine-tuned BERT models, comprising four different input formats: word-only, char-only, word+char, and word-piece BIBREF6 , we demonstrate that an adversary can degrade a classifier's performance to that achieved by random guessing. This requires altering just two characters per sentence. Such modifications might flip words either to a different word in the vocabulary or, more often, to the out-of-vocabulary token UNK. Consequently, adversarial edits can degrade a word-level model by transforming the informative words to UNK. Intuitively, one might suspect that word-piece and character-level models would be less susceptible to spelling attacks as they can make use of the residual word context. However, our experiments demonstrate that character and word-piece models are in fact more vulnerable. We show that this is due to the adversary's effective capacity for finer grained manipulations on these models. While against a word-level model, the adversary is mostly limited to UNK-ing words, against a word-piece or character-level model, each character-level add, drop, or swap produces a distinct input, providing the adversary with a greater set of options.\nSecond, we evaluate first-line techniques including data augmentation and adversarial training, demonstrating that they offer only marginal benefits here, e.g., a BERT model achieving $90.3$ accuracy on a sentiment classification task, is degraded to $64.1$ by an adversarially-chosen 1-character swap in the sentence, which can only be restored to $69.2$ by adversarial training.\nThird (our primary contribution), we propose a task-agnostic defense, attaching a word recognition model that predicts each word in a sentence given a full sequence of (possibly misspelled) inputs. The word recognition model's outputs form the input to a downstream classification model. Our word recognition models build upon the RNN-based semi-character word recognition model due to BIBREF7 . While our word recognizers are trained on domain-specific text from the task at hand, they often predict UNK at test time, owing to the small domain-specific vocabulary. To handle unobserved and rare words, we propose several backoff strategies including falling back on a generic word recognizer trained on a larger corpus. Incorporating our defenses, BERT models subject to 1-character attacks are restored to $88.3$ , $81.1$ , $78.0$ accuracy for swap, drop, add attacks respectively, as compared to $69.2$ , $63.6$ , and $50.0$ for adversarial training\nFourth, we offer a detailed qualitative analysis, demonstrating that a low word error rate alone is insufficient for a word recognizer to confer robustness on the downstream task. Additionally, we find that it is important that the recognition model supply few degrees of freedom to an attacker. We provide a metric to quantify this notion of sensitivity in word recognition models and study its relation to robustness empirically. Models with low sensitivity and word error rate are most robust.\nRelated Work\nSeveral papers address adversarial attacks on NLP systems. Changes to text, whether word- or character-level, are all perceptible, raising some questions about what should rightly be considered an adversarial example BIBREF8 , BIBREF9 . BIBREF10 address the reading comprehension task, showing that by appending distractor sentences to the end of stories from the SQuAD dataset BIBREF11 , they could cause models to output incorrect answers. Inspired by this work, BIBREF12 demonstrate an attack that breaks entailment systems by replacing a single word with either a synonym or its hypernym. Recently, BIBREF13 investigated the problem of producing natural-seeming adversarial examples, noting that adversarial examples in NLP are often ungrammatical BIBREF14 .\nIn related work on character-level attacks, BIBREF8 , BIBREF15 explored gradient-based methods to generate string edits to fool classification and translation systems, respectively. While their focus is on efficient methods for generating adversaries, ours is on improving the worst case adversarial performance. Similarly, BIBREF9 studied how synthetic and natural noise affects character-level machine translation. They considered structure invariant representations and adversarial training as defenses against such noise. Here, we show that an auxiliary word recognition model, which can be trained on unlabeled data, provides a strong defense.\nSpelling correction BIBREF16 is often viewed as a sub-task of grammatical error correction BIBREF17 , BIBREF18 . Classic methods rely on a source language model and a noisy channel model to find the most likely correction for a given word BIBREF19 , BIBREF20 . Recently, neural techniques have been applied to the task BIBREF7 , BIBREF21 , which model the context and orthography of the input together. Our work extends the ScRNN model of BIBREF7 .\nRobust Word Recognition\nTo tackle character-level adversarial attacks, we introduce a simple two-stage solution, placing a word recognition model ( $W$ ) before the downstream classifier ( $C$ ). Under this scheme, all inputs are classified by the composed model $C \\circ W$ . This modular approach, with $W$ and $C$ trained separately, offers several benefits: (i) we can deploy the same word recognition model for multiple downstream classification tasks/models; and (ii) we can train the word recognition model with larger unlabeled corpora.\nAgainst adversarial mistakes, two important factors govern the robustness of this combined model: $W$ 's accuracy in recognizing misspelled words and $W$ 's sensitivity to adversarial perturbations on the same input. We discuss these aspects in detail below.\nScRNN with Backoff\nWe now describe semi-character RNNs for word recognition, explain their limitations, and suggest techniques to improve them.\nInspired by the psycholinguistic studies BIBREF5 , BIBREF4 , BIBREF7 proposed a semi-character based RNN (ScRNN) that processes a sentence of words with misspelled characters, predicting the correct words at each step. Let $s = \\lbrace w_1, w_2, \\dots , w_n\\rbrace $ denote the input sentence, a sequence of constituent words $w_i$ . Each input word ( $w_i$ ) is represented by concatenating (i) a one hot vector of the first character ( $\\mathbf {w_{i1}}$ ); (ii) a one hot representation of the last character ( $\\mathbf {w_{il}}$ , where $l$ is the length of word $w_i$ ); and (iii) a bag of characters representation of the internal characters ( $\\sum _{j=2}^{l-1}\\mathbf {w_{ij}})$ . ScRNN treats the first and the last characters individually, and is agnostic to the ordering of the internal characters. Each word, represented accordingly, is then fed into a BiLSTM cell. At each sequence step, the training target is the correct corresponding word (output dimension equal to vocabulary size), and the model is optimized with cross-entropy loss.\nWhile BIBREF7 demonstrate strong word recognition performance, a drawback of their evaluation setup is that they only attack and evaluate on the subset of words that are a part of their training vocabulary. In such a setting, the word recognition performance is unreasonably dependent on the chosen vocabulary size. In principle, one can design models to predict (correctly) only a few chosen words, and ignore the remaining majority and still reach 100% accuracy. For the adversarial setting, rare and unseen words in the wild are particularly critical, as they provide opportunities for the attackers. A reliable word-recognizer should handle these cases gracefully. Below, we explore different ways to back off when the ScRNN predicts UNK (a frequent outcome for rare and unseen words):\nPass-through: word-recognizer passes on the (possibly misspelled) word as is.\nBackoff to neutral word: Alternatively, noting that passing $\\colorbox {gray!20}{\\texttt {UNK}}$ -predicted words through unchanged exposes the downstream model to potentially corrupted text, we consider backing off to a neutral word like `a', which has a similar distribution across classes.\nBackoff to background model: We also consider falling back upon a more generic word recognition model trained upon a larger, less-specialized corpus whenever the foreground word recognition model predicts UNK. Figure 1 depicts this scenario pictorially.\nEmpirically, we find that the background model (by itself) is less accurate, because of the large number of words it is trained to predict. Thus, it is best to train a precise foreground model on an in-domain corpus and focus on frequent words, and then to resort to a general-purpose background model for rare and unobserved words. Next, we delineate our second consideration for building robust word-recognizers.\nModel Sensitivity\nIn computer vision, an important factor determining the success of an adversary is the norm constraint on the perturbations allowed to an image ( $|| \\bf x - \\bf x^{\\prime }||_{\\infty } < \\epsilon $ ). Higher values of $\\epsilon $ lead to a higher chance of mis-classification for at least one $\\bf x^{\\prime }$ . Defense methods such as quantization BIBREF22 and thermometer encoding BIBREF23 try to reduce the space of perturbations available to the adversary by making the model invariant to small changes in the input.\nIn NLP, we often get such invariance for free, e.g., for a word-level model, most of the perturbations produced by our character-level adversary lead to an UNK at its input. If the model is robust to the presence of these UNK tokens, there is little room for an adversary to manipulate it. Character-level models, on the other hand, despite their superior performance in many tasks, do not enjoy such invariance. This characteristic invariance could be exploited by an attacker. Thus, to limit the number of different inputs to the classifier, we wish to reduce the number of distinct word recognition outputs that an attacker can induce, not just the number of words on which the model is \u201cfooled\u201d. We denote this property of a model as its sensitivity.\nWe can quantify this notion for a word recognition system $W$ as the expected number of unique outputs it assigns to a set of adversarial perturbations. Given a sentence $s$ from the set of sentences $\\mathcal {S}$ , let $A(s) = {s_1}^{\\prime } , {s_2}^{\\prime }, \\dots , {s_n}^{\\prime }$ denote the set of $n$ perturbations to it under attack type $A$ , and let $V$ be the function that maps strings to an input representation for the downstream classifier. For a word level model, $V$ would transform sentences to a sequence of word ids, mapping OOV words to the same UNK ID. Whereas, for a char (or word+char, word-piece) model, $V$ would map inputs to a sequence of character IDs. Formally, sensitivity is defined as\n$$S_{W,V}^A=\\mathbb {E}_{s}\\left[\\frac{\\#_{u}(V \\circ W({s_1}^{\\prime }), \\dots , V \\circ W({s_n}^{\\prime }))}{n}\\right] ,$$   (Eq. 12)\nwhere $V \\circ W (s_i)$ returns the input representation (of the downstream classifier) for the output string produced by the word-recognizer $W$ using $s_i$ and $\\#_{u}(\\cdot )$ counts the number of unique arguments.\nIntuitively, we expect a high value of $S_{W, V}^A$ to lead to a lower robustness of the downstream classifier, since the adversary has more degrees of freedom to attack the classifier. Thus, when using word recognition as a defense, it is prudent to design a low sensitivity system with a low error rate. However, as we will demonstrate, there is often a trade-off between sensitivity and error rate.\nSynthesizing Adversarial Attacks\nSuppose we are given a classifier $C: \\mathcal {S} \\rightarrow \\mathcal {Y}$ which maps natural language sentences $s \\in \\mathcal {S}$ to a label from a predefined set $y \\in \\mathcal {Y}$ . An adversary for this classifier is a function $A$ which maps a sentence $s$ to its perturbed versions $\\lbrace s^{\\prime }_1, s^{\\prime }_2, \\ldots , s^{\\prime }_{n}\\rbrace $ such that each $s^{\\prime }_i$ is close to $s$ under some notion of distance between sentences. We define the robustness of classifier $C$ to the adversary $A$ as:\n$$R_{C,A} = \\mathbb {E}_s \\left[\\min _{s^{\\prime } \\in A(s)} \\mathbb {1}[C(s^{\\prime }) = y]\\right],$$   (Eq. 14)\nwhere $y$ represents the ground truth label for $s$ . In practice, a real-world adversary may only be able to query the classifier a few times, hence $R_{C,A}$ represents the worst-case adversarial performance of $C$ . Methods for generating adversarial examples, such as HotFlip BIBREF8 , focus on efficient algorithms for searching the $\\min $ above. Improving $R_{C,A}$ would imply better robustness against all these methods.\nWe explore adversaries which perturb sentences with four types of character-level edits:\n(1) Swap: swapping two adjacent internal characters of a word. (2) Drop: removing an internal character of a word. (3) Keyboard: substituting an internal character with adjacent characters of QWERTY keyboard (4) Add: inserting a new character internally in a word. In line with the psycholinguistic studies BIBREF5 , BIBREF4 , to ensure that the perturbations do not affect human ability to comprehend the sentence, we only allow the adversary to edit the internal characters of a word, and not edit stopwords or words shorter than 4 characters.\nFor 1-character attacks, we try all possible perturbations listed above until we find an adversary that flips the model prediction. For 2-character attacks, we greedily fix the edit which had the least confidence among 1-character attacks, and then try all the allowed perturbations on the remaining words. Higher order attacks can be performed in a similar manner. The greedy strategy reduces the computation required to obtain higher order attacks, but also means that the robustness score is an upper bound on the true robustness of the classifier.\nExperiments and Results\nIn this section, we first discuss our experiments on the word recognition systems.\nWord Error Correction\nData: We evaluate the spell correctors from \u00a7 \"Robust Word Recognition\" on movie reviews from the Stanford Sentiment Treebank (SST) BIBREF24 . The SST dataset consists of 8544 movie reviews, with a vocabulary of over 16K words. As a background corpus, we use the IMDB movie reviews BIBREF25 , which contain 54K movie reviews, and a vocabulary of over 78K words. The two datasets do not share any reviews in common. The spell-correction models are evaluated on their ability to correct misspellings. The test setting consists of reviews where each word (with length $\\ge 4$ , barring stopwords) is attacked by one of the attack types (from swap, add, drop and keyboard attacks). In the all attack setting, we mix all attacks by randomly choosing one for each word. This most closely resembles a real world attack setting.\nIn addition to our word recognition models, we also compare to After The Deadline (ATD), an open-source spell corrector. We found ATD to be the best freely-available corrector. We refer the reader to BIBREF7 for comparisons of ScRNN to other anonymized commercial spell checkers.\nFor the ScRNN model, we use a single-layer Bi-LSTM with a hidden dimension size of 50. The input representation consists of 198 dimensions, which is thrice the number of unique characters (66) in the vocabulary. We cap the vocabulary size to 10K words, whereas we use the entire vocabulary of 78470 words when we backoff to the background model. For training these networks, we corrupt the movie reviews according to all attack types, i.e., applying one of the 4 attack types to each word, and trying to reconstruct the original words via cross entropy loss.\nWe calculate the word error rates (WER) of each of the models for different attacks and present our findings in Table 2 . Note that ATD incorrectly predicts $11.2$ words for every 100 words (in the `all' setting), whereas, all of the backoff variations of the ScRNN reconstruct better. The most accurate variant involves backing off to the background model, resulting in a low error rate of $6.9\\%$ , leading to the best performance on word recognition. This is a $32\\%$ relative error reduction compared to the vanilla ScRNN model with a pass-through backoff strategy. We can attribute the improved performance to the fact that there are $5.25\\%$ words in the test corpus that are unseen in the training corpus, and are thus only recoverable by backing off to a larger corpus. Notably, only training on the larger background corpus does worse, at $8.7\\%$ , since the distribution of word frequencies is different in the background corpus compared to the foreground corpus.\nRobustness to adversarial attacks\nWe use sentiment analysis and paraphrase detection as downstream tasks, as for these two tasks, 1-2 character edits do not change the output labels.\nFor sentiment classification, we systematically study the effect of character-level adversarial attacks on two architectures and four different input formats. The first architecture encodes the input sentence into a sequence of embeddings, which are then sequentially processed by a BiLSTM. The first and last states of the BiLSTM are then used by the softmax layer to predict the sentiment of the input. We consider three input formats for this architecture: (1) Word-only: where the input words are encoded using a lookup table; (2) Char-only: where the input words are encoded using a separate single-layered BiLSTM over their characters; and (3) Word $+$ Char: where the input words are encoded using a concatenation of (1) and (2) .\nThe second architecture uses the fine-tuned BERT model BIBREF26 , with an input format of word-piece tokenization. This model has recently set a new state-of-the-art on several NLP benchmarks, including the sentiment analysis task we consider here. All models are trained and evaluated on the binary version of the sentence-level Stanford Sentiment Treebank BIBREF24 dataset with only positive and negative reviews.\nWe also consider the task of paraphrase detection. Here too, we make use of the fine-tuned BERT BIBREF26 , which is trained and evaluated on the Microsoft Research Paraphrase Corpus (MRPC) BIBREF27 .\nTwo common methods for dealing with adversarial examples include: (1) data augmentation (DA) BIBREF28 ; and (2) adversarial training (Adv) BIBREF29 . In DA, the trained model is fine-tuned after augmenting the training set with an equal number of examples randomly attacked with a 1-character edit. In Adv, the trained model is fine-tuned with additional adversarial examples (selected at random) that produce incorrect predictions from the current-state classifier. The process is repeated iteratively, generating and adding newer adversarial examples from the updated classifier model, until the adversarial accuracy on dev set stops improving.\nIn Table 3 , we examine the robustness of the sentiment models under each attack and defense method. In the absence of any attack or defense, BERT (a word-piece model) performs the best ( $90.3\\%$ ) followed by word+char models ( $80.5\\%$ ), word-only models ( $79.2\\%$ ) and then char-only models ( $70.3\\%$ ). However, even single-character attacks (chosen adversarially) can be catastrophic, resulting in a significantly degraded performance of $46\\%$ , $57\\%$ , $59\\%$ and $33\\%$ , respectively under the `all' setting.\nIntuitively, one might suppose that word-piece and character-level models would be more robust to such attacks given they can make use of the remaining context. However, we find that they are the more susceptible. To see why, note that the word `beautiful' can only be altered in a few ways for word-only models, either leading to an UNK or an existing vocabulary word, whereas, word-piece and character-only models treat each unique character combination differently. This provides more variations that an attacker can exploit. Following similar reasoning, add and key attacks pose a greater threat than swap and drop attacks. The robustness of different models can be ordered as word-only $>$ word+char $>$ char-only $\\sim $ word-piece, and the efficacy of different attacks as add $>$ key $>$ drop $>$ swap.\nNext, we scrutinize the effectiveness of defense methods when faced against adversarially chosen attacks. Clearly from table 3 , DA and Adv are not effective in this case. We observed that despite a low training error, these models were not able to generalize to attacks on newer words at test time. ATD spell corrector is the most effective on keyboard attacks, but performs poorly on other attack types, particularly the add attack strategy.\nThe ScRNN model with pass-through backoff offers better protection, bringing back the adversarial accuracy within $5\\%$ range for the swap attack. It is also effective under other attack classes, and can mitigate the adversarial effect in word-piece models by $21\\%$ , character-only models by $19\\%$ , and in word, and word+char models by over $4.5\\%$ . This suggests that the direct training signal of word error correction is more effective than the indirect signal of sentiment classification available to DA and Adv for model robustness.\nWe observe additional gains by using background models as a backoff alternative, because of its lower word error rate (WER), especially, under the swap and drop attacks. However, these gains do not consistently translate in all other settings, as lower WER is necessary but not sufficient. Besides lower error rate, we find that a solid defense should furnish the attacker the fewest options to attack, i.e. it should have a low sensitivity.\nAs we shall see in section \u00a7 \"Understanding Model Sensitivity\" , the backoff neutral variation has the lowest sensitivity due to mapping UNK predictions to a fixed neutral word. Thus, it results in the highest robustness on most of the attack types for all four model classes.\nTable 4 shows the accuracy of BERT on 200 examples from the dev set of the MRPC paraphrase detection task under various attack and defense settings. We re-trained the ScRNN model variants on the MRPC training set for these experiments. Again, we find that simple 1-2 character attacks can bring down the accuracy of BERT significantly ( $89\\%$ to $31\\%$ ). Word recognition models can provide an effective defense, with both our pass-through and neutral variants recovering most of the accuracy. While the neutral backoff model is effective on 2-char attacks, it hurts performance in the no attack setting, since it incorrectly modifies certain correctly spelled entity names. Since the two variants are already effective, we did not train a background model for this task.\nUnderstanding Model Sensitivity\nTo study model sensitivity, for each sentence, we perturb one randomly-chosen word and replace it with all possible perturbations under a given attack type. The resulting set of perturbed sentences is then fed to the word recognizer (whose sensitivity is to be estimated). As described in equation 12 , we count the number of unique predictions from the output sentences. Two corrections are considered unique if they are mapped differently by the downstream classifier.\nThe neutral backoff variant has the lowest sensitivity (Table 5 ). This is expected, as it returns a fixed neutral word whenever the ScRNN predicts an UNK, therefore reducing the number of unique outputs it predicts. Open vocabulary (i.e. char-only, word+char, word-piece) downstream classifiers consider every unique combination of characters differently, whereas word-only classifiers internally treat all out of vocabulary (OOV) words alike. Hence, for char-only, word+char, and word-piece models, the pass-through version is more sensitive than the background variant, as it passes words as is (and each combination is considered uniquely). However, for word-only models, pass-through is less sensitive as all the OOV character combinations are rendered identical.\nIdeally, a preferred defense is one with low sensitivity and word error rate. In practice, however, we see that a low error rate often comes at the cost of sensitivity. We see this trade-off in Figure 2 , where we plot WER and sensitivity on the two axes, and depict the robustness when using different backoff variants. Generally, sensitivity is the more dominant factor out of the two, as the error rates of the considered variants are reasonably low.\nWe verify if the sentiment (of the reviews) is preserved with char-level attacks. In a human study with 50 attacked (and subsequently misclassified), and 50 unchanged reviews, it was noted that 48 and 49, respectively, preserved the sentiment.\nConclusion\nAs character and word-piece inputs become commonplace in modern NLP pipelines, it is worth highlighting the vulnerability they add. We show that minimally-doctored attacks can bring down accuracy of classifiers to random guessing. We recommend word recognition as a safeguard against this and build upon RNN-based semi-character word recognizers. We discover that when used as a defense mechanism, the most accurate word recognition models are not always the most robust against adversarial attacks. Additionally, we highlight the need to control the sensitivity of these models to achieve high robustness.\nAcknowledgements\nThe authors are grateful to Graham Neubig, Eduard Hovy, Paul Michel, Mansi Gupta, and Antonios Anastasopoulos for suggestions and feedback.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What baseline model is used?", "Which additional latent variables are used in the model?", "Which additional latent variables are used in the model?", "Which parallel corpora are used?", "Overall, does having parallel data improve semantic role induction across multiple languages?", "Do they add one latent variable for each language pair in their Bayesian model?", "What does an individual model consist of?", "Do they improve on state-of-the-art semantic role induction?"], "outputs": ["same baseline as used by lang2011unsupervised", "CLV as a parent of the two corresponding role variables", "crosslingual latent variables", "English (EN) and German (DE) sections of the CoNLL 2009 corpus BIBREF13, EN-DE section of the Europarl corpus BIBREF14", "No", "Yes", "Bayesian model of garg2012unsupervised as our base monolingual model", "Yes"], "input": "Introduction\nSemantic Role Labeling (SRL) has emerged as an important task in Natural Language Processing (NLP) due to its applicability in information extraction, question answering, and other NLP tasks. SRL is the problem of finding predicate-argument structure in a sentence, as illustrated below:\nINLINEFORM0\nHere, the predicate WRITE has two arguments: `Mike' as A0 or the writer, and `a book' as A1 or the thing written. The labels A0 and A1 correspond to the PropBank annotations BIBREF0 .\nAs the need for SRL arises in different domains and languages, the existing manually annotated corpora become insufficient to build supervised systems. This has motivated work on unsupervised SRL BIBREF1 , BIBREF2 , BIBREF3 . Previous work has indicated that unsupervised systems could benefit from the word alignment information in parallel text in two or more languages BIBREF4 , BIBREF5 , BIBREF6 . For example, consider the German translation of sentence INLINEFORM0 :\nINLINEFORM0\nIf sentences INLINEFORM0 and INLINEFORM1 have the word alignments: Mike-Mike, written-geschrieben, and book-Buch, the system might be able to predict A1 for Buch, even if there is insufficient information in the monolingual German data to learn this assignment. Thus, in languages where the resources are sparse or not good enough, or the distributions are not informative, SRL systems could be made more accurate by using parallel data with resource rich or more amenable languages.\nIn this paper, we propose a joint Bayesian model for unsupervised semantic role induction in multiple languages. The model consists of individual Bayesian models for each language BIBREF3 , and crosslingual latent variables to incorporate soft role agreement between aligned constituents. This latent variable approach has been demonstrated to increase the performance in a multilingual unsupervised part-of-speech tagging model based on HMMs BIBREF4 . We investigate the application of this approach to unsupervised SRL, presenting the performance improvements obtained in different settings involving labeled and unlabeled data, and analyzing the annotation effort required to obtain similar gains using labeled data.\nWe begin by briefly describing the unsupervised SRL pipeline and the monolingual semantic role induction model we use, and then describe our multilingual model.\nUnsupervised SRL Pipeline\nAs established in previous work BIBREF7 , BIBREF8 , we use a standard unsupervised SRL setup, consisting of the following steps:\nThe task we model, unsupervised semantic role induction, is the step 4 of this pipeline.\nMonolingual Model\nWe use the Bayesian model of garg2012unsupervised as our base monolingual model. The semantic roles are predicate-specific. To model the role ordering and repetition preferences, the role inventory for each predicate is divided into Primary and Secondary roles as follows:\nFor example, the complete role sequence in a frame could be: INLINEFORM0 INLINEFORM1 , INLINEFORM2 , INLINEFORM3 , INLINEFORM4 , INLINEFORM5 , INLINEFORM6 , INLINEFORM7 , INLINEFORM8 INLINEFORM9 . The ordering is defined as the sequence of PRs, INLINEFORM10 INLINEFORM11 , INLINEFORM12 , INLINEFORM13 , INLINEFORM14 , INLINEFORM15 INLINEFORM16 . Each pair of consecutive PRs in an ordering is called an interval. Thus, INLINEFORM17 is an interval that contains two SRs, INLINEFORM18 and INLINEFORM19 . An interval could also be empty, for instance INLINEFORM20 contains no SRs. When we evaluate, these roles get mapped to gold roles. For instance, the PR INLINEFORM21 could get mapped to a core role like INLINEFORM22 , INLINEFORM23 , etc. or to a modifier role like INLINEFORM24 , INLINEFORM25 , etc. garg2012unsupervised reported that, in practice, PRs mostly get mapped to core roles and SRs to modifier roles, which conforms to the linguistic motivations for this distinction.\nFigure FIGREF16 illustrates two copies of the monolingual model, on either side of the crosslingual latent variables. The generative process is as follows:\nAll the multinomial and binomial distributions have symmetric Dirichlet and beta priors respectively. Figure FIGREF7 gives the probability equations for the monolingual model. This formulation models the global role ordering and repetition preferences using PRs, and limited context for SRs using intervals. Ordering and repetition information was found to be helpful in supervised SRL as well BIBREF9 , BIBREF8 , BIBREF10 . More details, including the motivations behind this model, are in BIBREF3 .\nMultilingual Model\nThe multilingual model uses word alignments between sentences in a parallel corpus to exploit role correspondences across languages. We make copies of the monolingual model for each language and add additional crosslingual latent variables (CLVs) to couple the monolingual models, capturing crosslingual semantic role patterns. Concretely, when training on parallel sentences, whenever the head words of the arguments are aligned, we add a CLV as a parent of the two corresponding role variables. Figure FIGREF16 illustrates this model. The generative process, as explained below, remains the same as the monolingual model for the most part, with the exception of aligned roles which are now generated by both the monolingual process as well as the CLV.\nEvery predicate-tuple has its own inventory of CLVs specific to that tuple. Each CLV INLINEFORM0 is a multi-valued variable where each value defines a distribution over role labels for each language (denoted by INLINEFORM1 above). These distributions over labels are trained to be peaky, so that each value INLINEFORM2 for a CLV represents a correlation between the labels that INLINEFORM3 predicts in the two languages. For example, a value INLINEFORM4 for the CLV INLINEFORM5 might give high probabilities to INLINEFORM6 and INLINEFORM7 in language 1, and to INLINEFORM8 in language 2. If INLINEFORM9 is the only value for INLINEFORM10 that gives high probability to INLINEFORM11 in language 1, and the monolingual model in language 1 decides to assign INLINEFORM12 to the role for INLINEFORM13 , then INLINEFORM14 will predict INLINEFORM15 in language 2, with high probability. We generate the CLVs via a Chinese Restaurant Process BIBREF11 , a non-parametric Bayesian model, which allows us to induce the number of CLVs for every predicate-tuple from the data. We continue to train on the non-parallel sentences using the respective monolingual models.\nThe multilingual model is deficient, since the aligned roles are being generated twice. Ideally, we would like to add the CLV as additional conditioning variables in the monolingual models. The new joint probability can be written as equation UID11 (Figure FIGREF7 ), which can be further decomposed following the decomposition of the monolingual model in Figure FIGREF7 . However, having this additional conditioning variable breaks the Dirichlet-multinomial conjugacy, which makes it intractable to marginalize out the parameters during inference. Hence, we use an approximation where we treat each of the aligned roles as being generated twice, once by the monolingual model and once by the corresponding CLV (equation ).\nThis is the first work to incorporate the coupling of aligned arguments directly in a Bayesian SRL model. This makes it easier to see how to extend this model in a principled way to incorporate additional sources of information. First, the model scales gracefully to more than two languages. If there are a total of INLINEFORM0 languages, and there is an aligned argument in INLINEFORM1 of them, the multilingual latent variable is connected to only those INLINEFORM2 aligned arguments.\nSecond, having one joint Bayesian model allows us to use the same model in various semi-supervised learning settings, just by fixing the annotated variables during training. Section SECREF29 evaluates a setting where we have some labeled data in one language (called source), while no labeled data in the second language (called target). Note that this is different from a classic annotation projection setting (e.g. BIBREF12 ), where the role labels are mapped from source constituents to aligned target constituents.\nInference and Training\nThe inference problem consists of predicting the role labels and CLVs (the hidden variables) given the predicate, its voice, and syntactic features of all the identified arguments (the visible variables). We use a collapsed Gibbs-sampling based approach to generate samples for the hidden variables (model parameters are integrated out). The sample counts and the priors are then used to calculate the MAP estimate of the model parameters.\nFor the monolingual model, the role at a given position is sampled as:\nDISPLAYFORM0\nwhere the subscript INLINEFORM0 refers to all the variables except at position INLINEFORM1 , INLINEFORM2 refers to the variables in all the training instances except the current one, and INLINEFORM3 refers to all the model parameters. The above integral has a closed form solution due to Dirichlet-multinomial conjugacy.\nFor sampling roles in the multilingual model, we also need to consider the probabilities of roles being generated by the CLVs:\nDISPLAYFORM0\nFor sampling CLVs, we need to consider three factors: two corresponding to probabilities of generating the aligned roles, and the third one corresponding to selecting the CLV according to CRP.\nDISPLAYFORM0\nwhere the aligned roles INLINEFORM0 and INLINEFORM1 are connected to INLINEFORM2 , and INLINEFORM3 refers to all the variables except INLINEFORM4 , INLINEFORM5 , and INLINEFORM6 .\nWe use the trained parameters to parse the monolingual data using the monolingual model. The crosslingual parameters are ignored even if they were used during training. Thus, the information coming from the CLVs acts as a regularizer for the monolingual models.\nEvaluation\nFollowing the setting of titovcrosslingual, we evaluate only on the arguments that were correctly identified, as the incorrectly identified arguments do not have any gold semantic labels. Evaluation is done using the metric proposed by lang2011unsupervised, which has 3 components: (i) Purity (PU) measures how well an induced cluster corresponds to a single gold role, (ii) Collocation (CO) measures how well a gold role corresponds to a single induced cluster, and (iii) F1 is the harmonic mean of PU and CO. For each predicate, let INLINEFORM0 denote the total number of argument instances, INLINEFORM1 the instances in the induced cluster INLINEFORM2 , and INLINEFORM3 the instances having label INLINEFORM4 in gold annotations. INLINEFORM5 , INLINEFORM6 , and INLINEFORM7 . The score for each predicate is weighted by the number of its argument instances, and a weighted average is computed over all the predicates.\nBaseline\nWe use the same baseline as used by lang2011unsupervised which has been shown to be difficult to outperform. This baseline assigns a semantic role to a constituent based on its syntactic function, i.e. the dependency relation to its head. If there is a total of INLINEFORM0 clusters, INLINEFORM1 most frequent syntactic functions get a cluster each, and the rest are assigned to the INLINEFORM2 th cluster.\nClosest Previous Work\nThis work is closely related to the cross-lingual unsupervised SRL work of titovcrosslingual. Their model has separate monolingual models for each language and an extra penalty term which tries to maximize INLINEFORM0 and INLINEFORM1 i.e. for all the aligned arguments with role label INLINEFORM2 in language 1, it tries to find a role label INLINEFORM3 in language 2 such that the given proportion is maximized and vice verse. However, there is no efficient way to optimize the objective with this penalty term and the authors used an inference method similar to annotation projection. Further, the method does not scale naturally to more than two languages. Their algorithm first does monolingual inference in one language ignoring the penalty and then does the inference in the second language taking into account the penalty term. In contrast, our model adds the latent variables as a part of the model itself, and not an external penalty, which enables us to use the standard Bayesian learning methods such as sampling.\nThe monolingual model we use BIBREF3 also has two main advantages over titovcrosslingual. First, the former incorporates a global role ordering probability that is missing in the latter. Secondly, the latter defines argument-keys as a tuple of four syntactic features and all the arguments having the same argument-keys are assigned the same role. This kind of hard clustering is avoided in the former model where two constituents having the same set of features might get assigned different roles if they appear in different contexts.\nData\nFollowing titovcrosslingual, we run our experiments on the English (EN) and German (DE) sections of the CoNLL 2009 corpus BIBREF13 , and EN-DE section of the Europarl corpus BIBREF14 . We get about 40k EN and 36k DE sentences from the CoNLL 2009 training set, and about 1.5M parallel EN-DE sentences from Europarl. For appropriate comparison, we keep the same setting as in BIBREF6 for automatic parses and argument identification, which we briefly describe here. The EN sentences are parsed syntactically using MaltParser BIBREF15 and DE using LTH parser BIBREF16 . All the non-auxiliary verbs are selected as predicates. In CoNLL data, this gives us about 3k EN and 500 DE predicates. The total number of predicate instances are 3.4M in EN (89k CoNLL + 3.3M Europarl) and 2.62M in DE (17k CoNLL + 2.6M Europarl). The arguments for EN are identified using the heuristics proposed by lang2011unsupervised. However, we get an F1 score of 85.1% for argument identification on CoNLL 2009 EN data as opposed to 80.7% reported by titovcrosslingual. This could be due to implementation differences, which unfortunately makes our EN results incomparable. For DE, the arguments are identified using the LTH system BIBREF16 , which gives an F1 score of 86.5% on the CoNLL 2009 DE data. The word alignments for the EN-DE parallel Europarl corpus are computed using GIZA++ BIBREF17 . For high-precision, only the intersecting alignments in the two directions are kept. We define two semantic arguments as aligned if their head-words are aligned. In total we get 9.3M arguments for EN (240k CoNLL + 9.1M Europarl) and 4.43M for DE (32k CoNLL + 4.4M Europarl). Out of these, 0.76M arguments are aligned.\nMain Results\nSince the CoNLL annotations have 21 semantic roles in total, we use 21 roles in our model as well as the baseline. Following garg2012unsupervised, we set the number of PRs to 2 (excluding INLINEFORM0 , INLINEFORM1 and INLINEFORM2 ), and SRs to 21-2=19. Table TABREF27 shows the results.\nIn the first setting (Line 1), we train and test the monolingual model on the CoNLL data. We observe significant improvements in F1 score over the Baseline (Line 0) in both languages. Using the CoNLL 2009 dataset alone, titovcrosslingual report an F1 score of 80.9% (PU=86.8%, CO=75.7%) for German. Thus, our monolingual model outperforms their monolingual model in German. For English, they report an F1 score of 83.6% (PU=87.5%, CO=80.1%), but note that our English results are not directly comparable to theirs due to differences argument identification, as discussed in section SECREF25 . As their argument identification score is lower, perhaps their system is discarding \u201cdifficult\u201d arguments which leads to a higher clustering score.\nIn the second setting (Line 2), we use the additional monolingual Europarl (EP) data for training. We get equivalent results in English and a significant improvement in German compared to our previous setting (Line 1). The German dataset in CoNLL is quite small and benefits from the additional EP training data. In contrast, the English model is already quite good due to a relatively big dataset from CoNLL, and good accuracy syntactic parsers. Unfortunately, titovcrosslingual do not report results with this setting.\nThe third setting (Line 3) gives the results of our multilingual model, which adds the word alignments in the EP data. Comparing with Line 2, we get non-significant improvements in both languages. titovcrosslingual obtain an F1 score of 82.7% (PU=85.0%, CO=80.6%) for German, and 83.7% (PU=86.8%, CO=80.7%) for English. Thus, for German, our multilingual Bayesian model is able to capture the cross-lingual patterns at least as well as the external penalty term in BIBREF6 . We cannot compare the English results unfortunately due to differences in argument identification.\nWe also compared monolingual and bilingual training data using a setting that emulates the standard supervised setup of separate training and test data sets. We train only on the EP dataset and test on the CoNLL dataset. Lines 4 and 5 of Table TABREF27 give the results. The multilingual model obtains small improvements in both languages, which confirms the results from the standard unsupervised setup, comparing lines 2 to 3.\nThese results indicate that little information can be learned about semantic roles from this parallel data setup. One possible explanation for this result is that the setup itself is inadequate. Given the definition of aligned arguments, only 8% of English arguments and 17% of German arguments are aligned. This plus our experiments suggest that improving the alignment model is a necessary step to making effective use of parallel data in multilingual SRI, for example by joint modeling with SRI. We leave this exploration to future work.\nMultilingual Training with Labeled Data for One Language\nAnother motivation for jointly modeling SRL in multiple languages is the transfer of information from a resource rich language to a resource poor language. We evaluated our model in a very general annotation transfer scenario, where we have a small labeled dataset for one language (source), and a large parallel unlabeled dataset for the source and another (target) language. We investigate whether this setting improves the parameter estimates for the target language. To this end, we clamp the role annotations of the source language in the CoNLL dataset using a predefined mapping, and do not sample them during training. This data gives us good parameters for the source language, which are used to sample the roles of the source language in the unlabeled Europarl data. The CLVs aim to capture this improvement and thereby improve sampling and parameter estimates for the target language. Table TABREF28 shows the results of this experiment. We obtain small improvements in the target languages. As in the unsupervised setting, the small percentage of aligned roles probably limits the impact of the cross-lingual information.\nLabeled Data in Monolingual Model\nWe explored the improvement in the monolingual model in a semi-supervised setting. To this end, we randomly selected INLINEFORM0 of the sentences in the CoNLL dataset as \u201csupervised sentences\u201d and the rest INLINEFORM1 were kept unsupervised. Next, we clamped the role labels of the supervised sentences using the predefined mapping from Section SECREF29 . Sampling was done on the unsupervised sentences as usual. We then measured the clustering performance using the trained parameters.\nTo access the contribution of partial supervision better, we constructed a \u201csupervised baseline\u201d as follows. For predicates seen in the supervised sentences, a MAP estimate of the parameters was calculated using the predefined mapping. For the unseen predicates, the standard baseline was used.\nFigures FIGREF33 and FIGREF33 show the performance variation with INLINEFORM0 . We make the following observations:\n[leftmargin=*]\nIn both languages, at around INLINEFORM0 , the supervised baseline starts outperforming the semi-supervised model, which suggests that manually labeling about 10% of the sentences is a good enough alternative to our training procedure. Note that 10% amounts to about 3.6k sentences in German and 4k in English. We noticed that the proportion of seen predicates increases dramatically as we increase the proportion of supervised sentences. At 10% supervised sentences, the model has already seen 63% of predicates in German and 44% in English. This explains to some extent why only 10% labeled sentences are enough.\nFor German, it takes about 3.5% or 1260 supervised sentences to have the same performance increase as 1.5M unlabeled sentences (Line 1 to Line 2 in Table TABREF27 ). Adding about 180 more supervised sentences also covers the benefit obtained by alignments in the multilingual model (Line 2 to Line 3 in Table TABREF27 ). There is no noticeable performance difference in English.\nWe also evaluated the performance variation on a completely unseen CoNLL test set. Since the test set is very small compared to the training set, the clustering evaluation is not as reliable. Nonetheless, we broadly obtained the same pattern.\nRelated Work\nAs discussed in section SECREF24 , our work is closely related to the crosslingual unsupervised SRL work of titovcrosslingual. The idea of using superlingual latent variables to capture cross-lingual information was proposed for POS tagging by naseem2009multilingual, which we use here for SRL. In a semi-supervised setting, pado2009cross used a graph based approach to transfer semantic role annotations from English to German. furstenau2009graph used a graph alignment method to measure the semantic and syntactic similarity between dependency tree arguments of known and unknown verbs.\nFor monolingual unsupervised SRL, swier2004unsupervised presented the first work on a domain-general corpus, the British National Corpus, using 54 verbs taken from VerbNet. garg2012unsupervised proposed a Bayesian model for this problem that we use here. titov2012bayesian also proposed a closely related Bayesian model. grenager2006unsupervised proposed a generative model but their parameter space consisted of all possible linkings of syntactic constituents and semantic roles, which made unsupervised learning difficult and a separate language-specific rule based method had to be used to constrain this space. Other proposed models include an iterative split-merge algorithm BIBREF18 and a graph-partitioning based approach BIBREF1 . marquez2008semantic provide a good overview of the supervised SRL systems.\nConclusions\nWe propose a Bayesian model of semantic role induction (SRI) that uses crosslingual latent variables to capture role alignments in parallel corpora. The crosslingual latent variables capture correlations between roles in different languages, and regularize the parameter estimates of the monolingual models. Because this is a joint Bayesian model of multilingual SRI, we can apply the same model to a variety of training scenarios just by changing the inference procedure appropriately. We evaluate monolingual SRI with a large unlabeled dataset, bilingual SRI with a parallel corpus, bilingual SRI with annotations available for the source language, and monolingual SRI with a small labeled dataset. Increasing the amount of monolingual unlabeled data significantly improves SRI in German but not in English. Adding word alignments in parallel sentences results in small, non significant improvements, even if there is some labeled data available in the source language. This difficulty in showing the usefulness of parallel corpora for SRI may be due to the current assumptions about role alignments, which mean that only a small percentage of roles are aligned. Further analyses reveals that annotating small amounts of data can easily outperform the performance gains obtained by adding large unlabeled dataset as well as adding parallel corpora.\nFuture work includes training on different language pairs, on more than two languages, and with more inclusive models of role alignment.\nAcknowledgments\nThis work was funded by the Swiss NSF grant 200021_125137 and EC FP7 grant PARLANCE.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["how many tags do they look at?", "which algorithm was the highest performer?", "how is diversity measured?", "how large is the vocabulary?", "what dataset was used?", "what algorithms did they use?"], "outputs": ["48,705", "A hybrid model consisting of best performing popularity-based approach with the best similarity-based approach", " the average dissimilarity of all pairs of tags in the list of recommended tags", "33,663", " E-book annotation data: editor tags, Amazon search terms, and  Amazon review keywords.", "popularity-based, similarity-based, hybrid"], "input": "Introduction\nWhen people shop for books online in e-book stores such as, e.g., the Amazon Kindle store, they enter search terms with the goal to find e-books that meet their preferences. Such e-books have a variety of metadata such as, e.g., title, author or keywords, which can be used to retrieve e-books that are relevant to the query. As a consequence, from the perspective of e-book publishers and editors, annotating e-books with tags that best describe the content and which meet the vocabulary of users (e.g., when searching and reviewing e-books) is an essential task BIBREF0 .\nProblem and aim of this work. Annotating e-books with suitable tags is, however, a complex task as users' vocabulary may differ from the one of editors. Such a vocabulary mismatch yet hinders effective organization and retrieval BIBREF1 of e-books. For example, while editors mostly annotate e-books with descriptive tags that reflect the book's content, Amazon users often search for parts of the book title. In the data we use for the present study (see Section SECREF2 ), we find that around 30% of the Amazon search terms contain parts of e-book titles.\nIn this paper, we present our work to support editors in the e-book annotation process with tag recommendations BIBREF2 , BIBREF3 . Our idea is to exploit user-generated search query terms in Amazon to mimic the vocabulary of users in Amazon, who search for e-books. We combine these search terms with tags assigned by editors in a hybrid tag recommendation approach. Thus, our aim is to show that we can improve the performance of tag recommender systems for e-books both concerning recommendation accuracy as well as semantic similarity and tag recommendation diversity.\nRelated work. In tag recommender systems, mostly content-based algorithms (e.g., BIBREF4 , BIBREF5 ) are used to recommend tags to annotate resources such as e-books. In our work, we incorporate both content features of e-books (i.e., title and description text) as well as Amazon search terms to account for the vocabulary of e-book readers.\nConcerning the evaluation of tag recommendation systems, most studies focus on measuring the accuracy of tag recommendations (e.g., BIBREF2 ). However, the authors of BIBREF6 suggest also to use beyond-accuracy metrics such as diversity to evaluate the quality of tag recommendations. In our work, we measure recommendation diversity in addition to recommendation accuracy and propose a novel metric termed semantic similarity to validate semantic matches of tag recommendations.\nApproach and findings. We exploit editor tags and user-generated search terms as input for tag recommendation approaches. Our evaluation comprises of a rich set of 19 different algorithms to recommend tags for e-books, which we group into (i) popularity-based, (ii) similarity-based (i.e., using content information), and (iii) hybrid approaches. We evaluate our approaches in terms of accuracy, semantic similarity and diversity on the review content of Amazon users, which reflects the readers' vocabulary. With semantic similarity, we measure how semantically similar (based on learned Doc2Vec BIBREF7 embeddings) the list of recommended tags is to the list of relevant tags. We use this additional metric to measure not only exact \u201chits\u201d of our recommendations but also semantic matches.\nOur evaluation results show that combining both data sources enhances the quality of tag recommendations for annotating e-books. Furthermore, approaches that solely train on Amazon search terms provide poor performance in terms of accuracy but deliver good results in terms of semantic similarity and recommendation diversity.\nMethod\nIn this section, we describe our dataset as well as our tag recommendation approaches we propose to annotate e-books.\nDataset\nOur dataset contains two sources of data, one to generate tag recommendations and another one to evaluate tag recommendations. HGV GmbH has collected all data sources and we provide the dataset statistics in Table TABREF3 .\nData used to generate recommendations. We employ two sources of e-book annotation data: (i) editor tags, and (ii) Amazon search terms. For editor tags, we collect data of 48,705 e-books from 13 publishers, namely Kunstmann, Delius-Klasnig, VUR, HJR, Diogenes, Campus, Kiwi, Beltz, Chbeck, Rowohlt, Droemer, Fischer and Neopubli. Apart from the editor tags, this data contains metadata fields of e-books such as the ISBN, the title, a description text, the author and a list of BISACs, which are identifiers for book categories.\nFor the Amazon search terms, we collect search query logs of 21,243 e-books for 12 months (i.e., November 2017 to October 2018). Apart from the search terms, this data contains the e-books' ISBNs, titles and description texts.\nTable TABREF3 shows that the overlap of e-books that have editor tags and Amazon search terms is small (i.e., only 497). Furthermore, author and BISAC (i.e., the book category identifier) information are primarily available for e-books that contain editor tags. Consequently, both data sources provide complementary information, which underpins the intention of this work, i.e., to evaluate tag recommendation approaches using annotation sources from different contexts.\nData used to evaluate recommendations. For evaluation, we use a third set of e-book annotations, namely Amazon review keywords. These review keywords are extracted from the Amazon review texts and are typically provided in the review section of books on Amazon. Our idea is to not favor one or the other data source (i.e., editor tags and Amazon search terms) when evaluating our approaches against expected tags. At the same time, we consider Amazon review keywords to be a good mixture of editor tags and search terms as they describe both the content and the users' opinions on the e-books (i.e., the readers' vocabulary). As shown in Table TABREF3 , we collect Amazon review keywords for 2,896 e-books (publishers: Kiwi, Rowohlt, Fischer, and Droemer), which leads to 33,663 distinct review keywords and on average 30 keyword assignments per e-book.\nTag Recommendation Approaches\nWe implement three types of tag recommendation approaches, i.e., (i) popularity-based, (ii) similarity-based (i.e., using content information), and (iii) hybrid approaches. Due to the lack of personalized tags (i.e., we do not know which user has assigned a tag), we do not implement other types of algorithms such as collaborative filtering BIBREF8 . In total, we evaluate 19 different algorithms to recommend tags for annotating e-books.\nPopularity-based approaches. We recommend the most frequently used tags in the dataset, which is a common strategy for tag recommendations BIBREF9 . That is, a most popular INLINEFORM0 approach for editor tags and a most popular INLINEFORM1 approach for Amazon search terms. For e-books, for which we also have author (= INLINEFORM2 and INLINEFORM3 ) or BISAC (= INLINEFORM4 and INLINEFORM5 ) information, we use these features to further filter the recommended tags, i.e., to only recommend tags that were used to annotate e-books of a specific author or a specific BISAC.\nWe combine both data sources (i.e., editor tags and Amazon search terms) using a round-robin combination strategy, which ensures an equal weight for both sources. This gives us three additional popularity-based algorithms (= INLINEFORM0 , INLINEFORM1 and INLINEFORM2 ).\nSimilarity-based approaches. We exploit the textual content of e-books (i.e., description or title) to recommend relevant tags BIBREF10 . For this, we first employ a content-based filtering approach BIBREF11 based on TF-IDF BIBREF12 to find top- INLINEFORM0 similar e-books. For each of the similar e-books, we then either extract the assigned editor tags (= INLINEFORM2 and INLINEFORM3 ) or the Amazon search terms (= INLINEFORM4 and INLINEFORM5 ). To combine the tags of the top- INLINEFORM6 similar e-books, we use the cross-source algorithm BIBREF13 , which favors tags that were used to annotate more than one similar e-book (i.e., tags that come from multiple recommendation sources). The final tag relevancy is calculated as: DISPLAYFORM0\nwhere INLINEFORM0 denotes the number of distinct e-books, which yielded the recommendation of tag INLINEFORM1 , to favor tags that come from multiple sources and INLINEFORM2 is the similarity score of the corresponding e-book. We again use a round-robin strategy to combine both data sources (= INLINEFORM3 and INLINEFORM4 ).\nHybrid approaches. We use the previously mentioned cross-source algorithm BIBREF13 to construct four hybrid recommendation approaches. In this case, tags are favored that are recommended by more than one algorithm.\nHence, to create a popularity-based hybrid (= INLINEFORM0 ), we combine the best three performing popularity-based approaches from the ones (i) without any contextual signal, (ii) with the author as context, and (iii) with BISAC as context. In the case of the similarity-based hybrid (= INLINEFORM1 ), we utilize the two best performing similarity-based approaches from the ones (i) which use the title, and (ii) which use the description text. We further define INLINEFORM2 , a hybrid approach that combines the three popularity-based methods of INLINEFORM3 and the two similarity-based approaches of INLINEFORM4 . Finally, we define INLINEFORM5 as a hybrid approach that uses the best performing popularity-based and the best performing similarity-based approach (see Figure FIGREF11 in Section SECREF4 for more details about the particular algorithm combinations).\nExperimental Setup\nIn this section, we describe our evaluation protocol as well as the measures we use to evaluate and compare our tag recommendation approaches.\nEvaluation Protocol\nFor evaluation, we use the third set of e-book annotations, namely Amazon review keywords. As described in Section SECREF1 , these review keywords are extracted from the Amazon review texts and thus, reflect the users' vocabulary. We evaluate our approaches for the 2,896 e-books, for whom we got review keywords. To follow common practice for tag recommendation evaluation BIBREF14 , we predict the assigned review keywords (= our test set) for respective e-books.\nEvaluation Metrics\nIn this work, we measure (i) recommendation accuracy, (ii) semantic similarity, and (iii) recommendation diversity to evaluate the quality of our approaches from different perspectives.\nRecommendation accuracy. We use Normalized Discounted Cumulative Gain (nDCG) BIBREF15 to measure the accuracy of the tag recommendation approaches. The nDCG measure is a standard ranking-dependent metric that not only measures how many tags can be correctly predicted but also takes into account their position in the recommendation list with length of INLINEFORM0 . It is based on the Discounted Cummulative Gain, which is given by: DISPLAYFORM0\nwhere INLINEFORM0 is a function that returns 1 if the recommended tag at position INLINEFORM1 in the recommended list is relevant. We then calculate DCG@ INLINEFORM2 for every evaluated e-book by dividing DCG@ INLINEFORM3 with the ideal DCG value iDCG@ INLINEFORM4 , which is the highest possible DCG value that can be achieved if all the relevant tags would be recommended in the correct order. It is given by the following formula BIBREF15 : DISPLAYFORM0\nSemantic similarity. One precondition of standard recommendation accuracy measures is that to generate a \u201chit\u201d, the recommended tag needs to be an exact syntactical match to the one from the test set. When tags are recommended from one data source and compared to tags from another source, this can be problematic. For example, if we recommend the tag \u201cvictim\u201d but expect the tag \u201cprey\u201d, we would mark this as a mismatch, therefore being a bad recommendation. But if we know that the corresponding e-book is a crime novel, the recommended tag would be (semantically) descriptive to reflect the book's content. Hence, in this paper, we propose to additionally measure the semantic similarity between recommended tags and tags from the test set (i.e., the Amazon review keywords).\nOver the last four years, there have been several notable publications in the area of applying deep learning to uncover semantic relationships between textual content (e.g., by learning word embeddings with Word2Vec BIBREF16 , BIBREF17 ). Based on this, we propose an alternative measure of recommendation quality by learning the semantic relationships from both vocabularies and then using it to compare how semantically similar the recommended tags are to the expected review keywords. For this, we first extract the textual content in the form of the description text, title, editor tags and Amazon search terms of e-books from our dataset. We then train a Doc2Vec BIBREF7 model on the content. Then, we use the model to infer the latent representation for both the complete list of recommended tags as well as the list of expected tags from the test set. Finally, we use the cosine similarity measure to calculate how semantically similar these two lists are.\nRecommendation diversity. As defined in BIBREF18 , we calculate recommendation diversity as the average dissimilarity of all pairs of tags in the list of recommended tags. Thus, given a distance function INLINEFORM0 that corresponds to the dissimilarity between two tags INLINEFORM1 and INLINEFORM2 in the list of recommended tags, INLINEFORM3 is given as the average dissimilarity of all pairs of tags: DISPLAYFORM0\nwhere INLINEFORM0 is the number of evaluated e-books and the dissimilarity function is defined as INLINEFORM1 . In our experiments, we use the previously trained Doc2Vec model to extract the latent representation of a specific tag. The similarity of two tags INLINEFORM2 is then calculated with the Cosine similarity measure using the latent vector representations of respective tags INLINEFORM3 and INLINEFORM4 .\nResults\nConcerning tag recommendation accuracy, in this section, we report results for different values of INLINEFORM0 (i.e., number of recommended tags). For the beyond-accuracy experiment, we use the full list of recommended tags (i.e., INLINEFORM1 ).\nRecommendation Accuracy Evaluation\nFigure FIGREF11 shows the results of the accuracy experiment for the (i) popularity-based, (ii) similarity-based, and (iii) hybrid tag recommendation approaches.\nPopularity-based approaches. In Figure FIGREF11 , we see that popularity-based approaches based on editor tags tend to perform better than if trained on Amazon search terms. If we take into account contextual information like BISAC or author, we can further improve accuracy in terms of INLINEFORM0 . That is, we find that using popular tags from e-books of a specific author leads to the best accuracy of the popularity-based approaches. This suggests that editors and readers do seem to reuse tags for e-books of same authors. If we use both editor tags and Amazon search terms, we can further increase accuracy, especially for higher values of INLINEFORM1 like in the case of INLINEFORM2 . This is, however, not the case for INLINEFORM3 as the accuracy of the integrated INLINEFORM4 approach is low. The reason for this is the limited amount of e-books from within the Amazon search query logs that have BISAC information (i.e., only INLINEFORM5 ).\nSimilarity-based approaches. We further improve accuracy if we first find similar e-books and then extract their top- INLINEFORM0 tags in a cross-source manner as described in Section SECREF4 .\nAs shown in Figure FIGREF11 , using the description text to find similar e-books results in more accurate tag recommendations than using the title (i.e., INLINEFORM0 for INLINEFORM1 ). This is somehow expected as the description text consists of a bigger corpus of words (i.e., multiple sentences) than the title. Concerning the collected Amazon search query logs, extracting and then recommending tags from this source results in a much lower accuracy performance. Thus, these results also suggest to investigate beyond-accuracy metrics as done in Section SECREF17 .\nHybrid approaches. Figure FIGREF11 shows the accuracy results of the four hybrid approaches. By combining the best three popularity-based approaches, we outperform all of the initially evaluated popularity algorithms (i.e., INLINEFORM0 for INLINEFORM1 ). On the contrary, the combination of the two best performing similarity-based approaches INLINEFORM2 and INLINEFORM3 does not yield better accuracy. The negative impact of using a lower-performing approach such as INLINEFORM4 within a hybrid combination can also be observed in INLINEFORM5 for lower values of INLINEFORM6 . Overall, this confirms our initial intuition that combining the best performing popularity-based approach with the best similarity-based approach should result in the highest accuracy (i.e., INLINEFORM7 for INLINEFORM8 ). Moreover, our goal, namely to exploit editor tags in combination with search terms used by readers to increase the metadata quality of e-books, is shown to be best supported by applying hybrid approaches as they provide the best prediction results.\nBeyond-Accuracy Evaluation\nFigure FIGREF16 illustrates the results of the experiments, which measure the recommendation impact beyond-accuracy.\nSemantic similarity. Figure FIGREF16 illustrates the results of our proposed semantic similarity measure. To compare our proposed measure to standard accuracy measures such as INLINEFORM0 , we use Kendall's Tau rank correlation BIBREF19 as suggested by BIBREF20 for automatic evaluation of information-ordering tasks. From that, we rank our recommendation approaches according to both accuracy and semantic similarity and calculate the relation between both rankings. This results in INLINEFORM1 with a p-value < INLINEFORM2 , which suggests a high correlation between the semantic similarity and the standard accuracy measure.\nTherefore, the semantic similarity measure helps us interpret the recommendation quality. For instance, we achieve the lowest INLINEFORM0 values with the similarity-based approaches that recommend Amazon search terms (i.e., INLINEFORM1 and INLINEFORM2 ). When comparing these results with others from Figure FIGREF11 , a conclusion could be quickly drawn that the recommended tags are merely unusable. However, by looking at Figure FIGREF16 , we see that, although these approaches do not provide the highest recommendation accuracy, they still result in tag recommendations that are semantically related at a high degree to the expected annotations from the test set. Overall, this suggests that approaches, which provide a poor accuracy performance concerning INLINEFORM4 but provide a good performance regarding semantic similarity could still be helpful for annotating e-books.\nRecommendation diversity. Figure FIGREF16 shows the diversity of the tag recommendation approaches. We achieve the highest diversity with the similarity-based approaches, which extract Amazon search terms. Their accuracy is, however, very low. Thus, the combination of the two vocabularies can provide a good trade-off between recommendation accuracy and diversity.\nConclusion and Future Work\nIn this paper, we present our work to support editors in the e-book annotation process. Specifically, we aim to provide tag recommendations that incorporate both the vocabulary of the editors and e-book readers. Therefore, we train various configurations of tag recommender approaches on editors' tags and Amazon search terms and evaluate them on a dataset containing Amazon review keywords. We find that combining both data sources enhances the quality of tag recommendations for annotating e-books. Furthermore, while approaches that train only on Amazon search terms provide poor performance concerning recommendation accuracy, we show that they still offer helpful annotations concerning recommendation diversity as well as our novel semantic similarity metric.\nFuture work. For future work, we plan to validate our findings using another dataset, e.g., by recommending tags for scientific articles and books in BibSonomy. With this, we aim to demonstrate the usefulness of the proposed approach in a similar domain and to enhance the reproducibility of our results by using an open dataset.\nMoreover, we plan to evaluate our tag recommendation approaches in a study with domain users. Also, we want to improve our similarity-based approaches by integrating novel embedding approaches BIBREF16 , BIBREF17 as we did, for example, with our proposed semantic similarity evaluation metric. Finally, we aim to incorporate explanations for recommended tags so that editors of e-book annotations receive additional support in annotating e-books BIBREF21 . By making the underlying (semantic) reasoning visible to the editor who is in charge of tailoring annotations, we aim to support two goals: (i) allowing readers to discover e-books more efficiently, and (ii) enabling publishers to leverage semi-automatic categorization processes for e-books. In turn, providing explanations fosters control over which vocabulary to choose when tagging e-books for different application contexts.\nAcknowledgments. The authors would like to thank Peter Langs, Jan-Philipp Wolf and Alyona Schraa from HGV GmbH for providing the e-book annotation data. This work was funded by the Know-Center GmbH (FFG COMET Program), the FFG Data Market Austria project and the AI4EU project (EU grant 825619). The Know-Center GmbH is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the auspices of the Austrian Ministry of Transport, Innovation and Technology, the Austrian Ministry of Economics and Labor and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency (FFG).", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What baseline method is used?", "What details are given about the Twitter dataset?", "What details are given about the movie domain dataset?", "Which hand-crafted features are combined with word2vec?", "What word-based and dictionary-based feature are used?", "How are the supervised scores of the words calculated?"], "outputs": ["use the word2vec algorithm, create several unsupervised hand-crafted features, generate document vectors and feed them as input into the support vector machines (SVM) approach", "one of the Twitter datasets is about Turkish mobile network operators, there are positive, neutral and negative labels and provide the total amount plus the distribution of labels", "there are 20,244 reviews divided into positive and negative with an average 39 words per review, each one having a star-rating score", "polarity scores, which are minimum, mean, and maximum polarity scores, from each review", "generate word embeddings specific to a domain, TDK (T\u00fcrk Dil Kurumu - \u201cTurkish Language Institution\u201d) dictionary to obtain word polarities", "(+1 or -1), words of opposite polarities (e.g. \u201chappy\" and \u201cunhappy\") get far away from each other"], "input": "Introduction\nSentiment analysis has recently been one of the hottest topics in natural language processing (NLP). It is used to identify and categorise opinions expressed by reviewers on a topic or an entity. Sentiment analysis can be leveraged in marketing, social media analysis, and customer service. Although many studies have been conducted for sentiment analysis in widely spoken languages, this topic is still immature for Turkish and many other languages.\nNeural networks outperform the conventional machine learning algorithms in most classification tasks, including sentiment analysis BIBREF0. In these networks, word embedding vectors are fed as input to overcome the data sparsity problem and make the representations of words more \u201cmeaningful\u201d and robust. Those embeddings indicate how close the words are to each other in the vector space model (VSM).\nMost of the studies utilise embeddings, such as word2vec BIBREF1, which take into account the syntactic and semantic representations of the words only. Discarding the sentimental aspects of words may lead to words of different polarities being close to each other in the VSM, if they share similar semantic and syntactic features.\nFor Turkish, there are only a few studies which leverage sentimental information in generating the word and document embeddings. Unlike the studies conducted for English and other widely-spoken languages, in this paper, we use the official dictionaries for this language and combine the unsupervised and supervised scores to generate a unified score for each dimension of the word embeddings in this task.\nOur main contribution is to create original and effective word vectors that capture syntactic, semantic, and sentimental characteristics of words, and use all of this knowledge in generating embeddings. We also utilise the word2vec embeddings trained on a large corpus. Besides using these word embeddings, we also generate hand-crafted features on a review basis and create document vectors. We evaluate those embeddings on two datasets. The results show that we outperform the approaches which do not take into account the sentimental information. We also had better performances than other studies carried out on sentiment analysis in Turkish media. We also evaluated our novel embedding approaches on two English corpora of different genres. We outperformed the baseline approaches for this language as well. The source code and datasets are publicly available.\nThe paper is organised as follows. In Section 2, we present the existing works on sentiment classification. In Section 3, we describe the methods proposed in this work. The experimental results are shown and the main contributions of our proposed approach are discussed in Section 4. In Section 5, we conclude the paper.\nRelated Work\nIn the literature, the main consensus is that the use of dense word embeddings outperform the sparse embeddings in many tasks. Latent semantic analysis (LSA) used to be the most popular method in generating word embeddings before the invention of the word2vec and other word vector algorithms which are mostly created by shallow neural network models. Although many studies have been employed on generating word vectors including both semantic and sentimental components, generating and analysing the effects of different types of embeddings on different tasks is an emerging field for Turkish.\nLatent Dirichlet allocation (LDA) is used in BIBREF2 to extract mixture of latent topics. However, it focusses on finding the latent topics of a document, not the word meanings themselves. In BIBREF3, LSA is utilised to generate word vectors, leveraging indirect co-occurrence statistics. These outperform the use of sparse vectors BIBREF4.\nSome of the prior studies have also taken into account the sentimental characteristics of a word when creating word vectors BIBREF5, BIBREF6, BIBREF7. A model with semantic and sentiment components is built in BIBREF8, making use of star-ratings of reviews. In BIBREF9, a sentiment lexicon is induced preferring the use of domain-specific co-occurrence statistics over the word2vec method and they outperform the latter.\nIn a recent work on sentiment analysis in Turkish BIBREF10, they learn embeddings using Turkish social media. They use the word2vec algorithm, create several unsupervised hand-crafted features, generate document vectors and feed them as input into the support vector machines (SVM) approach. We outperform this baseline approach using more effective word embeddings and supervised hand-crafted features.\nIn English, much of the recent work on learning sentiment-specific embeddings relies only on distant supervision. In BIBREF11, emojis are used as features and a bi-LSTM (bidirectional long short-term memory) neural network model is built to learn sentiment-aware word embeddings. In BIBREF12, a neural network that learns word embeddings is built by using contextual information about the data and supervised scores of the words. This work captures the supervised information by utilising emoticons as features. Most of our approaches do not rely on a neural network model in learning embeddings. However, they produce state-of-the-art results.\nMethodology\nWe generate several word vectors, which capture the sentimental, lexical, and contextual characteristics of words. In addition to these mostly original vectors, we also create word2vec embeddings to represent the corpus words by training the embedding model on these datasets. After generating these, we combine them with hand-crafted features to create document vectors and perform classification, as will be explained in Section 3.5.\nMethodology ::: Corpus-based Approach\nContextual information is informative in the sense that, in general, similar words tend to appear in the same contexts. For example, the word smart is more likely to cooccur with the word hardworking than with the word lazy. This similarity can be defined semantically and sentimentally. In the corpus-based approach, we capture both of these characteristics and generate word embeddings specific to a domain.\nFirstly, we construct a matrix whose entries correspond to the number of cooccurrences of the row and column words in sliding windows. Diagonal entries are assigned the number of sliding windows that the corresponding row word appears in the whole corpus. We then normalise each row by dividing entries in the row by the maximum score in it.\nSecondly, we perform the principal component analysis (PCA) method to reduce the dimensionality. It captures latent meanings and takes into account high-order cooccurrence removing noise. The attribute (column) number of the matrix is reduced to 200. We then compute cosine similarity between each row pair $w_i$ and $w_j$ as in (DISPLAY_FORM3) to find out how similar two word vectors (rows) are.\nThirdly, all the values in the matrix are subtracted from 1 to create a dissimilarity matrix. Then, we feed the matrix as input into the fuzzy c-means clustering algorithm. We chose the number of clusters as 200, as it is considered a standard for word embeddings in the literature. After clustering, the dimension i for a corresponding word indicates the degree to which this word belongs to cluster i. The intuition behind this idea is that if two words are similar in the VSM, they are more likely to belong to the same clusters with akin probabilities. In the end, each word in the corpus is represented by a 200-dimensional vector.\nIn addition to this method, we also perform singular value decomposition (SVD) on the cooccurrence matrices, where we compute the matrix $M^{PPMI} = U\\Sigma V^{T}$. Positive pointwise mutual information (PPMI) scores between words are calculated and the truncated singular value decomposition is computed. We take into account the U matrix only for each word. We have chosen the singular value number as 200. That is, each word in the corpus is represented by a 200-dimensional vector as follows.\nMethodology ::: Dictionary-based Approach\nIn Turkish, there do not exist well-established sentiment lexicons as in English. In this approach, we made use of the TDK (T\u00fcrk Dil Kurumu - \u201cTurkish Language Institution\u201d) dictionary to obtain word polarities. Although it is not a sentiment lexicon, combining it with domain-specific polarity scores obtained from the corpus led us to have state-of-the-art results.\nWe first construct a matrix whose row entries are corpus words and column entries are the words in their dictionary definitions. We followed the boolean approach. For instance, for the word cat, the column words occurring in its dictionary definition are given a score of 1. Those column words not appearing in the definition of cat are assigned a score of 0 for that corresponding row entry.\nWhen we performed clustering on this matrix, we observed that those words having similar meanings are, in general, assigned to the same clusters. However, this similarity fails in capturing the sentimental characteristics. For instance, the words happy and unhappy are assigned to the same cluster, since they have the same words, such as feeling, in their dictionary definitions. However, they are of opposite polarities and should be discerned from each other.\nTherefore, we utilise a metric to move such words away from each other in the VSM, even though they have common words in their dictionary definitions. We multiply each value in a row with the corresponding row word's raw supervised score, thereby having more meaningful clusters. Using the training data only, the supervised polarity score per word is calculated as in (DISPLAY_FORM4).\nHere, $ w_{t}$ denotes the sentiment score of word $t$, $N_{t}$ is the number of documents (reviews or tweets) in which $t$ occurs in the dataset of positive polarity, $N$ is the number of all the words in the corpus of positive polarity. $N^{\\prime }$ denotes the corpus of negative polarity. $N^{\\prime }_{t}$ and $N^{\\prime }$ denote similar values for the negative polarity corpus. We perform normalisation to prevent the imbalance problem and add a small number to both numerator and denominator for smoothing.\nAs an alternative to multiplying with the supervised polarity scores, we also separately multiplied all the row scores with only +1 if the row word is a positive word, and with -1 if it is a negative word. We have observed it boosts the performance more compared to using raw scores.\nThe effect of this multiplication is exemplified in Figure FIGREF7, showing the positions of word vectors in the VSM. Those \u201cx\" words are sentimentally negative words, those \u201co\" words are sentimentally positive ones. On the top coordinate plane, the words of opposite polarities are found to be close to each other, since they have common words in their dictionary definitions. Only the information concerned with the dictionary definitions are used there, discarding the polarity scores. However, when we utilise the supervised score (+1 or -1), words of opposite polarities (e.g. \u201chappy\" and \u201cunhappy\") get far away from each other as they are translated across coordinate regions. Positive words now appear in quadrant 1, whereas negative words appear in quadrant 3. Thus, in the VSM, words that are sentimentally similar to each other could be clustered more accurately. Besides clustering, we also employed the SVD method to perform dimensionality reduction on the unsupervised dictionary algorithm and used the newly generated matrix by combining it with other subapproaches. The number of dimensions is chosen as 200 again according to the $U$ matrix. The details are given in Section 3.4. When using and evaluating this subapproach on the English corpora, we used the SentiWordNet lexicon BIBREF13. We have achieved better results for the dictionary-based algorithm when we employed the SVD reduction method compared to the use of clustering.\nMethodology ::: Supervised Contextual 4-scores\nOur last component is a simple metric that uses four supervised scores for each word in the corpus. We extract these scores as follows. For a target word in the corpus, we scan through all of its contexts. In addition to the target word's polarity score (the self score), out of all the polarity scores of words occurring in the same contexts as the target word, minimum, maximum, and average scores are taken into consideration. The word polarity scores are computed using (DISPLAY_FORM4). Here, we obtain those scores from the training data.\nThe intuition behind this method is that those four scores are more indicative of a word's polarity rather than only one (the self score). This approach is fully supervised unlike the previous two approaches.\nMethodology ::: Combination of the Word Embeddings\nIn addition to using the three approaches independently, we also combined all the matrices generated in the previous approaches. That is, we concatenate the reduced forms (SVD - U) of corpus-based, dictionary-based, and the whole of 4-score vectors of each word, horizontally. Accordingly, each corpus word is represented by a 404-dimensional vector, since corpus-based and dictionary-based vector components are each composed of 200 dimensions, whereas the 4-score vector component is formed by four values.\nThe main intuition behind the ensemble method is that some approaches compensate for what the others may lack. For example, the corpus-based approach captures the domain-specific, semantic, and syntactic characteristics. On the other hand, the 4-scores method captures supervised features, and the dictionary-based approach is helpful in capturing the general semantic characteristics. That is, combining those three approaches makes word vectors more representative.\nMethodology ::: Generating Document Vectors\nAfter creating several embeddings as mentioned above, we create document (review or tweet) vectors. For each document, we sum all the vectors of words occurring in that document and take their average. In addition to it, we extract three hand-crafted polarity scores, which are minimum, mean, and maximum polarity scores, from each review. These polarity scores of words are computed as in (DISPLAY_FORM4). For example, if a review consists of five words, it would have five polarity scores and we utilise only three of these sentiment scores as mentioned. Lastly, we concatenate these three scores to the averaged word vector per review.\nThat is, each review is represented by the average word vector of its constituent word embeddings and three supervised scores. We then feed these inputs into the SVM approach. The flowchart of our framework is given in Figure FIGREF11. When combining the unsupervised features, which are word vectors created on a word basis, with supervised three scores extracted on a review basis, we have better state-of-the-art results.\nDatasets\nWe utilised two datasets for both Turkish and English to evaluate our methods.\nFor Turkish, as the first dataset, we utilised the movie reviews which are collected from a popular website. The number of reviews in this movie corpus is 20,244 and the average number of words in reviews is 39. Each of these reviews has a star-rating score which is indicative of sentiment. These polarity scores are between the values 0.5 and 5, at intervals of 0.5. We consider a review to be negative it the score is equal to or lower than 2.5. On the other hand, if it is equal to or higher than 4, it is assumed to be positive. We have randomly selected 7,020 negative and 7,020 positive reviews and processed only them.\nThe second Turkish dataset is the Twitter corpus which is formed of tweets about Turkish mobile network operators. Those tweets are mostly much noisier and shorter compared to the reviews in the movie corpus. In total, there are 1,716 tweets. 973 of them are negative and 743 of them are positive. These tweets are manually annotated by two humans, where the labels are either positive or negative. We measured the Cohen's Kappa inter-annotator agreement score to be 0.82. If there was a disagreement on the polarity of a tweet, we removed it.\nWe also utilised two other datasets in English to test the cross-linguality of our approaches. One of them is a movie corpus collected from the web. There are 5,331 positive reviews and 5,331 negative reviews in this corpus. The other is a Twitter dataset, which has nearly 1.6 million tweets annotated through a distant supervised method BIBREF14. These tweets have positive, neutral, and negative labels. We have selected 7,020 positive tweets and 7,020 negative tweets randomly to generate a balanced dataset.\nExperiments ::: Preprocessing\nIn Turkish, people sometimes prefer to spell English characters for the corresponding Turkish characters (e.g. i for \u0131, c for \u00e7) when writing in electronic format. To normalise such words, we used the Zemberek tool BIBREF15. All punctuation marks except \u201c!\" and \u201c?\" are removed, since they do not contribute much to the polarity of a document. We took into account emoticons, such as \u201c:))\", and idioms, such as \u201ckafay\u0131 yemek\u201d (lose one's mind), since two or more words can express a sentiment together, irrespective of the individual words thereof. Since Turkish is an agglutinative language, we used the morphological parser and disambiguator tools BIBREF16, BIBREF17. We also performed negation handling and stop-word elimination. In negation handling, we append an underscore to the end of a word if it is negated. For example, \u201cg\u00fczel de\u011fil\" (not beautiful) is redefined as \u201cg\u00fczel_\" (beautiful_) in the feature selection stage when supervised scores are being computed.\nExperiments ::: Hyperparameters\nWe used the LibSVM utility of the WEKA tool. We chose the linear kernel option to classify the reviews. We trained word2vec embeddings on all the four corpora using the Gensim library BIBREF18 with the skip-gram method. The dimension size of these embeddings are set at 200. As mentioned, other embeddings, which are generated utilising the clustering and the SVD approach, are also of size 200. For c-mean clustering, we set the maximum number of iterations at 25, unless it converges.\nExperiments ::: Results\nWe evaluated our models on four corpora, which are the movie and the Twitter datasets in Turkish and English. All of the embeddings are learnt on four corpora separately. We have used the accuracy metric since all the datasets are completely or nearly completely balanced. We performed 10-fold cross-validation for both of the datasets. We used the approximate randomisation technique to test whether our results are statistically significant. Here, we tried to predict the labels of reviews and assess the performance.\nWe obtained varying accuracies as shown in Table TABREF17. \u201c3 feats\" features are those hand-crafted features we extracted, which are the minimum, mean, and maximum polarity scores of the reviews as explained in Section 3.5. As can be seen, at least one of our methods outperforms the baseline word2vec approach for all the Turkish and English corpora, and all categories. All of our approaches performed better when we used the supervised scores, which are extracted on a review basis, and concatenated them to word vectors. Mostly, the supervised 4-scores feature leads to the highest accuracies, since it employs the annotational information concerned with polarities on a word basis.\nAs can be seen in Table TABREF17, the clustering method, in general, yields the lowest scores. We found out that the corpus - SVD metric does always perform better than the clustering method. We attribute it to that in SVD the most important singular values are taken into account. The corpus - SVD technique outperforms the word2vec algorithm for some corpora. When we do not take into account the 3-feats technique, the corpus-based SVD method yields the highest accuracies for the English Twitter dataset. We show that simple models can outperform more complex models, such as the concatenation of the three subapproaches or the word2vec algorithm. Another interesting finding is that for some cases the accuracy decreases when we utilise the polarity labels, as in the case for the English Twitter dataset.\nSince the TDK dictionary covers most of the domain-specific vocabulary used in the movie reviews, the dictionary method performs well. However, the dictionary lacks many of the words, occurring in the tweets; therefore, its performance is not the best of all. When the TDK method is combined with the 3-feats technique, we observed a great improvement, as can be expected.\nSuccess rates obtained for the movie corpus are much better than those for the Twitter dataset for most of our approaches, since tweets are, in general, much shorter and noisier. We also found out that, when choosing the p value as 0.05, our results are statistically significant compared to the baseline approach in Turkish BIBREF10. Some of our subapproaches also produce better success rates than those sentiment analysis models employed in English BIBREF11, BIBREF12. We have achieved state-of-the-art results for the sentiment classification task for both Turkish and English. As mentioned, our approaches, in general, perform best in predicting the labels of reviews when three supervised scores are additionality utilised.\nWe also employed the convolutional neural network model (CNN). However, the SVM classifier, which is a conventional machine learning algorithm, performed better. We did not include the performances of CNN for embedding types here due to the page limit of the paper.\nAs a qualitative assessment of the word representations, given some query words we visualised the most similar words to those words using the cosine similarity metric. By assessing the similarities between a word and all the other corpus words, we can find the most akin words according to different approaches. Table TABREF18 shows the most similar words to given query words. Those words which are indicative of sentiment are, in general, found to be most similar to those words of the same polarity. For example, the most akin word to muhte\u015fem (gorgeous) is 10/10, both of which have positive polarity. As can be seen in Table TABREF18, our corpus-based approach is more adept at capturing domain-specific features as compared to word2vec, which generally captures general semantic and syntactic characteristics, but not the sentimental ones.\nConclusion\nWe have demonstrated that using word vectors that capture only semantic and syntactic characteristics may be improved by taking into account their sentimental aspects as well. Our approaches are cross-lingual and cross-domain. They can be applied to other domains and other languages than Turkish and English with minor changes.\nOur study is one of the few ones that perform sentiment analysis in Turkish and leverages sentimental characteristics of words in generating word vectors and outperforms all the others. Any of the approaches we propose can be used independently of the others. Our approaches without using sentiment labels can be applied to other classification tasks, such as topic classification and concept mining.\nThe experiments show that even unsupervised approaches, as in the corpus-based approach, can outperform supervised approaches in classification tasks. Combining some approaches, which can compensate for what others lack, can help us build better vectors. Our word vectors are created by conventional machine learning algorithms; however, they, as in the corpus-based model, produce state-of-the-art results. Although we preferred to use a classical machine learning algorithm, which is SVM, over a neural network classifier to predict the labels of reviews, we achieved accuracies of over 90 per cent for the Turkish movie corpus and about 88 per cent for the English Twitter dataset.\nWe performed only binary sentiment classification in this study as most of the studies in the literature do. We will extend our system in future by using neutral reviews as well. We also plan to employ Turkish WordNet to enhance the generalisability of our embeddings as another future work.\nAcknowledgments\nThis work was supported by Bo\u011fazi\u00e7i University Research Fund Grant Number 6980D, and by Turkish Ministry of Development under the TAM Project number DPT2007K12-0610. Cem R\u0131fk\u0131 Ayd\u0131n is supported by T\u00dcB\u0130TAK BIDEB 2211E.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What topic is covered in the Chinese Facebook data? ", "How many layers does the UTCNN model have?", "What topics are included in the debate data?", "What topics are included in the debate data?", "What is the size of the Chinese data?", "Did they collected the two datasets?", "What are the baselines?"], "outputs": ["anti-nuclear-power", "eight layers", "abortion, gay rights, Obama, marijuana", "abortion (ABO), gay rights (GAY), Obama (OBA), and marijuana (MAR)", "32,595", "No", "SVM with unigram, bigram, trigram features, with average word embedding, with average transformed word embeddings, CNN and RCNN, SVM, CNN, RCNN with comment information"], "input": "Introduction\nThis work is licenced under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ Deep neural networks have been widely used in text classification and have achieved promising results BIBREF0 , BIBREF1 , BIBREF2 . Most focus on content information and use models such as convolutional neural networks (CNN) BIBREF3 or recursive neural networks BIBREF4 . However, for user-generated posts on social media like Facebook or Twitter, there is more information that should not be ignored. On social media platforms, a user can act either as the author of a post or as a reader who expresses his or her comments about the post.\nIn this paper, we classify posts taking into account post authorship, likes, topics, and comments. In particular, users and their \u201clikes\u201d hold strong potential for text mining. For example, given a set of posts that are related to a specific topic, a user's likes and dislikes provide clues for stance labeling. From a user point of view, users with positive attitudes toward the issue leave positive comments on the posts with praise or even just the post's content; from a post point of view, positive posts attract users who hold positive stances. We also investigate the influence of topics: different topics are associated with different stance labeling tendencies and word usage. For example we discuss women's rights and unwanted babies on the topic of abortion, but we criticize medicine usage or crime when on the topic of marijuana BIBREF5 . Even for posts on a specific topic like nuclear power, a variety of arguments are raised: green energy, radiation, air pollution, and so on. As for comments, we treat them as additional text information. The arguments in the comments and the commenters (the users who leave the comments) provide hints on the post's content and further facilitate stance classification.\nIn this paper, we propose the user-topic-comment neural network (UTCNN), a deep learning model that utilizes user, topic, and comment information. We attempt to learn user and topic representations which encode user interactions and topic influences to further enhance text classification, and we also incorporate comment information. We evaluate this model on a post stance classification task on forum-style social media platforms. The contributions of this paper are as follows: 1. We propose UTCNN, a neural network for text in modern social media channels as well as legacy social media, forums, and message boards \u2014 anywhere that reveals users, their tastes, as well as their replies to posts. 2. When classifying social media post stances, we leverage users, including authors and likers. User embeddings can be generated even for users who have never posted anything. 3. We incorporate a topic model to automatically assign topics to each post in a single topic dataset. 4. We show that overall, the proposed method achieves the highest performance in all instances, and that all of the information extracted, whether users, topics, or comments, still has its contributions.\nExtra-Linguistic Features for Stance Classification\nIn this paper we aim to use text as well as other features to see how they complement each other in a deep learning model. In the stance classification domain, previous work has showed that text features are limited, suggesting that adding extra-linguistic constraints could improve performance BIBREF6 , BIBREF7 , BIBREF8 . For example, Hasan and Ng as well as Thomas et al. require that posts written by the same author have the same stance BIBREF9 , BIBREF10 . The addition of this constraint yields accuracy improvements of 1\u20137% for some models and datasets. Hasan and Ng later added user-interaction constraints and ideology constraints BIBREF7 : the former models the relationship among posts in a sequence of replies and the latter models inter-topic relationships, e.g., users who oppose abortion could be conservative and thus are likely to oppose gay rights.\nFor work focusing on online forum text, since posts are linked through user replies, sequential labeling methods have been used to model relationships between posts. For example, Hasan and Ng use hidden Markov models (HMMs) to model dependent relationships to the preceding post BIBREF9 ; Burfoot et al. use iterative classification to repeatedly generate new estimates based on the current state of knowledge BIBREF11 ; Sridhar et al. use probabilistic soft logic (PSL) to model reply links via collaborative filtering BIBREF12 . In the Facebook dataset we study, we use comments instead of reply links. However, as the ultimate goal in this paper is predicting not comment stance but post stance, we treat comments as extra information for use in predicting post stance.\nDeep Learning on Extra-Linguistic Features\nIn recent years neural network models have been applied to document sentiment classification BIBREF13 , BIBREF4 , BIBREF14 , BIBREF15 , BIBREF2 . Text features can be used in deep networks to capture text semantics or sentiment. For example, Dong et al. use an adaptive layer in a recursive neural network for target-dependent Twitter sentiment analysis, where targets are topics such as windows 7 or taylor swift BIBREF16 , BIBREF17 ; recursive neural tensor networks (RNTNs) utilize sentence parse trees to capture sentence-level sentiment for movie reviews BIBREF4 ; Le and Mikolov predict sentiment by using paragraph vectors to model each paragraph as a continuous representation BIBREF18 . They show that performance can thus be improved by more delicate text models.\nOthers have suggested using extra-linguistic features to improve the deep learning model. The user-word composition vector model (UWCVM) BIBREF19 is inspired by the possibility that the strength of sentiment words is user-specific; to capture this they add user embeddings in their model. In UPNN, a later extension, they further add a product-word composition as product embeddings, arguing that products can also show different tendencies of being rated or reviewed BIBREF20 . Their addition of user information yielded 2\u201310% improvements in accuracy as compared to the above-mentioned RNTN and paragraph vector methods. We also seek to inject user information into the neural network model. In comparison to the research of Tang et al. on sentiment classification for product reviews, the difference is two-fold. First, we take into account multiple users (one author and potentially many likers) for one post, whereas only one user (the reviewer) is involved in a review. Second, we add comment information to provide more features for post stance classification. None of these two factors have been considered previously in a deep learning model for text stance classification. Therefore, we propose UTCNN, which generates and utilizes user embeddings for all users \u2014 even for those who have not authored any posts \u2014 and incorporates comments to further improve performance.\nMethod\nIn this section, we first describe CNN-based document composition, which captures user- and topic-dependent document-level semantic representation from word representations. Then we show how to add comment information to construct the user-topic-comment neural network (UTCNN).\nUser- and Topic-dependent Document Composition\nAs shown in Figure FIGREF4 , we use a general CNN BIBREF3 and two semantic transformations for document composition . We are given a document with an engaged user INLINEFORM0 , a topic INLINEFORM1 , and its composite INLINEFORM2 words, each word INLINEFORM3 of which is associated with a word embedding INLINEFORM4 where INLINEFORM5 is the vector dimension. For each word embedding INLINEFORM6 , we apply two dot operations as shown in Equation EQREF6 : DISPLAYFORM0\nwhere INLINEFORM0 models the user reading preference for certain semantics, and INLINEFORM1 models the topic semantics; INLINEFORM2 and INLINEFORM3 are the dimensions of transformed user and topic embeddings respectively. We use INLINEFORM4 to model semantically what each user prefers to read and/or write, and use INLINEFORM5 to model the semantics of each topic. The dot operation of INLINEFORM6 and INLINEFORM7 transforms the global representation INLINEFORM8 to a user-dependent representation. Likewise, the dot operation of INLINEFORM9 and INLINEFORM10 transforms INLINEFORM11 to a topic-dependent representation.\nAfter the two dot operations on INLINEFORM0 , we have user-dependent and topic-dependent word vectors INLINEFORM1 and INLINEFORM2 , which are concatenated to form a user- and topic-dependent word vector INLINEFORM3 . Then the transformed word embeddings INLINEFORM4 are used as the CNN input. Here we apply three convolutional layers on the concatenated transformed word embeddings INLINEFORM5 : DISPLAYFORM0\nwhere INLINEFORM0 is the index of words; INLINEFORM1 is a non-linear activation function (we use INLINEFORM2 ); INLINEFORM5 is the convolutional filter with input length INLINEFORM6 and output length INLINEFORM7 , where INLINEFORM8 is the window size of the convolutional operation; and INLINEFORM9 and INLINEFORM10 are the output and bias of the convolution layer INLINEFORM11 , respectively. In our experiments, the three window sizes INLINEFORM12 in the three convolution layers are one, two, and three, encoding unigram, bigram, and trigram semantics accordingly.\nAfter the convolutional layer, we add a maximum pooling layer among convolutional outputs to obtain the unigram, bigram, and trigram n-gram representations. This is succeeded by an average pooling layer for an element-wise average of the three maximized convolution outputs.\nUTCNN Model Description\nFigure FIGREF10 illustrates the UTCNN model. As more than one user may interact with a given post, we first add a maximum pooling layer after the user matrix embedding layer and user vector embedding layer to form a moderator matrix embedding INLINEFORM0 and a moderator vector embedding INLINEFORM1 for moderator INLINEFORM2 respectively, where INLINEFORM3 is used for the semantic transformation in the document composition process, as mentioned in the previous section. The term moderator here is to denote the pseudo user who provides the overall semantic/sentiment of all the engaged users for one document. The embedding INLINEFORM4 models the moderator stance preference, that is, the pattern of the revealed user stance: whether a user is willing to show his preference, whether a user likes to show impartiality with neutral statements and reasonable arguments, or just wants to show strong support for one stance. Ideally, the latent user stance is modeled by INLINEFORM5 for each user. Likewise, for topic information, a maximum pooling layer is added after the topic matrix embedding layer and topic vector embedding layer to form a joint topic matrix embedding INLINEFORM6 and a joint topic vector embedding INLINEFORM7 for topic INLINEFORM8 respectively, where INLINEFORM9 models the semantic transformation of topic INLINEFORM10 as in users and INLINEFORM11 models the topic stance tendency. The latent topic stance is also modeled by INLINEFORM12 for each topic.\nAs for comments, we view them as short documents with authors only but without likers nor their own comments. Therefore we apply document composition on comments although here users are commenters (users who comment). It is noticed that the word embeddings INLINEFORM0 for the same word in the posts and comments are the same, but after being transformed to INLINEFORM1 in the document composition process shown in Figure FIGREF4 , they might become different because of their different engaged users. The output comment representation together with the commenter vector embedding INLINEFORM2 and topic vector embedding INLINEFORM3 are concatenated and a maximum pooling layer is added to select the most important feature for comments. Instead of requiring that the comment stance agree with the post, UTCNN simply extracts the most important features of the comment contents; they could be helpful, whether they show obvious agreement or disagreement. Therefore when combining comment information here, the maximum pooling layer is more appropriate than other pooling or merging layers. Indeed, we believe this is one reason for UTCNN's performance gains.\nFinally, the pooled comment representation, together with user vector embedding INLINEFORM0 , topic vector embedding INLINEFORM1 , and document representation are fed to a fully connected network, and softmax is applied to yield the final stance label prediction for the post.\nExperiment\nWe start with the experimental dataset and then describe the training process as well as the implementation of the baselines. We also implement several variations to reveal the effects of features: authors, likers, comment, and commenters. In the results section we compare our model with related work.\nDataset\nWe tested the proposed UTCNN on two different datasets: FBFans and CreateDebate. FBFans is a privately-owned, single-topic, Chinese, unbalanced, social media dataset, and CreateDebate is a public, multiple-topic, English, balanced, forum dataset. Results using these two datasets show the applicability and superiority for different topics, languages, data distributions, and platforms.\nThe FBFans dataset contains data from anti-nuclear-power Chinese Facebook fan groups from September 2013 to August 2014, including posts and their author and liker IDs. There are a total of 2,496 authors, 505,137 likers, 33,686 commenters, and 505,412 unique users. Two annotators were asked to take into account only the post content to label the stance of the posts in the whole dataset as supportive, neutral, or unsupportive (hereafter denoted as Sup, Neu, and Uns). Sup/Uns posts were those in support of or against anti-reconstruction; Neu posts were those evincing a neutral standpoint on the topic, or were irrelevant. Raw agreement between annotators is 0.91, indicating high agreement. Specifically, Cohen\u2019s Kappa for Neu and not Neu labeling is 0.58 (moderate), and for Sup or Uns labeling is 0.84 (almost perfect). Posts with inconsistent labels were filtered out, and the development and testing sets were randomly selected from what was left. Posts in the development and testing sets involved at least one user who appeared in the training set. The number of posts for each stance is shown on the left-hand side of Table TABREF12 . About twenty percent of the posts were labeled with a stance, and the number of supportive (Sup) posts was much larger than that of the unsupportive (Uns) ones: this is thus highly skewed data, which complicates stance classification. On average, 161.1 users were involved in one post. The maximum was 23,297 and the minimum was one (the author). For comments, on average there were 3 comments per post. The maximum was 1,092 and the minimum was zero.\nTo test whether the assumption of this paper \u2013 posts attract users who hold the same stance to like them \u2013 is reliable, we examine the likes from authors of different stances. Posts in FBFans dataset are used for this analysis. We calculate the like statistics of each distinct author from these 32,595 posts. As the numbers of authors in the Sup, Neu and Uns stances are largely imbalanced, these numbers are normalized by the number of users of each stance. Table TABREF13 shows the results. Posts with stances (i.e., not neutral) attract users of the same stance. Neutral posts also attract both supportive and neutral users, like what we observe in supportive posts, but just the neutral posts can attract even more neutral likers. These results do suggest that users prefer posts of the same stance, or at least posts of no obvious stance which might cause annoyance when reading, and hence support the user modeling in our approach.\nThe CreateDebate dataset was collected from an English online debate forum discussing four topics: abortion (ABO), gay rights (GAY), Obama (OBA), and marijuana (MAR). The posts are annotated as for (F) and against (A). Replies to posts in this dataset are also labeled with stance and hence use the same data format as posts. The labeling results are shown in the right-hand side of Table TABREF12 . We observe that the dataset is more balanced than the FBFans dataset. In addition, there are 977 unique users in the dataset. To compare with Hasan and Ng's work, we conducted five-fold cross-validation and present the annotation results as the average number of all folds BIBREF9 , BIBREF5 .\nThe FBFans dataset has more integrated functions than the CreateDebate dataset; thus our model can utilize all linguistic and extra-linguistic features. For the CreateDebate dataset, on the other hand, the like and comment features are not available (as there is a stance label for each reply, replies are evaluated as posts as other previous work) but we still implemented our model using the content, author, and topic information.\nSettings\nIn the UTCNN training process, cross-entropy was used as the loss function and AdaGrad as the optimizer. For FBFans dataset, we learned the 50-dimensional word embeddings on the whole dataset using GloVe BIBREF21 to capture the word semantics; for CreateDebate dataset we used the publicly available English 50-dimensional word embeddings, pre-trained also using GloVe. These word embeddings were fixed in the training process. The learning rate was set to 0.03. All user and topic embeddings were randomly initialized in the range of [-0.1 0.1]. Matrix embeddings for users and topics were sized at 250 ( INLINEFORM0 ); vector embeddings for users and topics were set to length 10.\nWe applied the LDA topic model BIBREF22 on the FBFans dataset to determine the latent topics with which to build topic embeddings, as there is only one general known topic: nuclear power plants. We learned 100 latent topics and assigned the top three topics for each post. For the CreateDebate dataset, which itself constitutes four topics, the topic labels for posts were used directly without additionally applying LDA.\nFor the FBFans data we report class-based f-scores as well as the macro-average f-score ( INLINEFORM0 ) shown in equation EQREF19 . DISPLAYFORM0\nwhere INLINEFORM0 and INLINEFORM1 are the average precision and recall of the three class. We adopted the macro-average f-score as the evaluation metric for the overall performance because (1) the experimental dataset is severely imbalanced, which is common for contentious issues; and (2) for stance classification, content in minor-class posts is usually more important for further applications. For the CreateDebate dataset, accuracy was adopted as the evaluation metric to compare the results with related work BIBREF7 , BIBREF9 , BIBREF12 .\nBaselines\nWe pit our model against the following baselines: 1) SVM with unigram, bigram, and trigram features, which is a standard yet rather strong classifier for text features; 2) SVM with average word embedding, where a document is represented as a continuous representation by averaging the embeddings of the composite words; 3) SVM with average transformed word embeddings (the INLINEFORM0 in equation EQREF6 ), where a document is represented as a continuous representation by averaging the transformed embeddings of the composite words; 4) two mature deep learning models on text classification, CNN BIBREF3 and Recurrent Convolutional Neural Networks (RCNN) BIBREF0 , where the hyperparameters are based on their work; 5) the above SVM and deep learning models with comment information; 6) UTCNN without user information, representing a pure-text CNN model where we use the same user matrix and user embeddings INLINEFORM1 and INLINEFORM2 for each user; 7) UTCNN without the LDA model, representing how UTCNN works with a single-topic dataset; 8) UTCNN without comments, in which the model predicts the stance label given only user and topic information. All these models were trained on the training set, and parameters as well as the SVM kernel selections (linear or RBF) were fine-tuned on the development set. Also, we adopt oversampling on SVMs, CNN and RCNN because the FBFans dataset is highly imbalanced.\nResults on FBFans Dataset\nIn Table TABREF22 we show the results of UTCNN and the baselines on the FBFans dataset. Here Majority yields good performance on Neu since FBFans is highly biased to the neutral class. The SVM models perform well on Sup and Neu but perform poorly for Uns, showing that content information in itself is insufficient to predict stance labels, especially for the minor class. With the transformed word embedding feature, SVM can achieve comparable performance as SVM with n-gram feature. However, the much fewer feature dimension of the transformed word embedding makes SVM with word embeddings a more efficient choice for modeling the large scale social media dataset. For the CNN and RCNN models, they perform slightly better than most of the SVM models but still, the content information is insufficient to achieve a good performance on the Uns posts. As to adding comment information to these models, since the commenters do not always hold the same stance as the author, simply adding comments and post contents together merely adds noise to the model.\nAmong all UTCNN variations, we find that user information is most important, followed by topic and comment information. UTCNN without user information shows results similar to SVMs \u2014 it does well for Sup and Neu but detects no Uns. Its best f-scores on both Sup and Neu among all methods show that with enough training data, content-based models can perform well; at the same time, the lack of user information results in too few clues for minor-class posts to either predict their stance directly or link them to other users and posts for improved performance. The 17.5% improvement when adding user information suggests that user information is especially useful when the dataset is highly imbalanced. All models that consider user information predict the minority class successfully. UCTNN without topic information works well but achieves lower performance than the full UTCNN model. The 4.9% performance gain brought by LDA shows that although it is satisfactory for single topic datasets, adding that latent topics still benefits performance: even when we are discussing the same topic, we use different arguments and supporting evidence. Lastly, we get 4.8% improvement when adding comment information and it achieves comparable performance to UTCNN without topic information, which shows that comments also benefit performance. For platforms where user IDs are pixelated or otherwise hidden, adding comments to a text model still improves performance. In its integration of user, content, and comment information, the full UTCNN produces the highest f-scores on all Sup, Neu, and Uns stances among models that predict the Uns class, and the highest macro-average f-score overall. This shows its ability to balance a biased dataset and supports our claim that UTCNN successfully bridges content and user, topic, and comment information for stance classification on social media text. Another merit of UTCNN is that it does not require a balanced training data. This is supported by its outperforming other models though no oversampling technique is applied to the UTCNN related experiments as shown in this paper. Thus we can conclude that the user information provides strong clues and it is still rich even in the minority class.\nWe also investigate the semantic difference when a user acts as an author/liker or a commenter. We evaluated a variation in which all embeddings from the same user were forced to be identical (this is the UTCNN shared user embedding setting in Table TABREF22 ). This setting yielded only a 2.5% improvement over the model without comments, which is not statistically significant. However, when separating authors/likers and commenters embeddings (i.e., the UTCNN full model), we achieved much greater improvements (4.8%). We attribute this result to the tendency of users to use different wording for different roles (for instance author vs commenter). This is observed when the user, acting as an author, attempts to support her argument against nuclear power by using improvements in solar power; when acting as a commenter, though, she interacts with post contents by criticizing past politicians who supported nuclear power or by arguing that the proposed evacuation plan in case of a nuclear accident is ridiculous. Based on this finding, in the final UTCNN setting we train two user matrix embeddings for one user: one for the author/liker role and the other for the commenter role.\nResults on CreateDebate Dataset\nTable TABREF24 shows the results of UTCNN, baselines as we implemented on the FBFans datset and related work on the CreateDebate dataset. We do not adopt oversampling on these models because the CreateDebate dataset is almost balanced. In previous work, integer linear programming (ILP) or linear-chain conditional random fields (CRFs) were proposed to integrate text features, author, ideology, and user-interaction constraints, where text features are unigram, bigram, and POS-dependencies; the author constraint tends to require that posts from the same author for the same topic hold the same stance; the ideology constraint aims to capture inferences between topics for the same author; the user-interaction constraint models relationships among posts via user interactions such as replies BIBREF7 , BIBREF9 .\nThe SVM with n-gram or average word embedding feature performs just similar to the majority. However, with the transformed word embedding, it achieves superior results. It shows that the learned user and topic embeddings really capture the user and topic semantics. This finding is not so obvious in the FBFans dataset and it might be due to the unfavorable data skewness for SVM. As for CNN and RCNN, they perform slightly better than most SVMs as we found in Table TABREF22 for FBFans.\nCompared to the ILP BIBREF7 and CRF BIBREF9 methods, the UTCNN user embeddings encode author and user-interaction constraints, where the ideology constraint is modeled by the topic embeddings and text features are modeled by the CNN. The significant improvement achieved by UTCNN suggests the latent representations are more effective than overt model constraints.\nThe PSL model BIBREF12 jointly labels both author and post stance using probabilistic soft logic (PSL) BIBREF23 by considering text features and reply links between authors and posts as in Hasan and Ng's work. Table TABREF24 reports the result of their best AD setting, which represents the full joint stance/disagreement collective model on posts and is hence more relevant to UTCNN. In contrast to their model, the UTCNN user embeddings represent relationships between authors, but UTCNN models do not utilize link information between posts. Though the PSL model has the advantage of being able to jointly label the stances of authors and posts, its performance on posts is lower than the that for the ILP or CRF models. UTCNN significantly outperforms these models on posts and has the potential to predict user stances through the generated user embeddings.\nFor the CreateDebate dataset, we also evaluated performance when not using topic embeddings or user embeddings; as replies in this dataset are viewed as posts, the setting without comment embeddings is not available. Table TABREF24 shows the same findings as Table TABREF22 : the 21% improvement in accuracy demonstrates that user information is the most vital. This finding also supports the results in the related work: user constraints are useful and can yield 11.2% improvement in accuracy BIBREF7 . Further considering topic information yields 3.4% improvement, suggesting that knowing the subject of debates provides useful information. In sum, Table TABREF22 together with Table TABREF24 show that UTCNN achieves promising performance regardless of topic, language, data distribution, and platform.\nConclusion\nWe have proposed UTCNN, a neural network model that incorporates user, topic, content and comment information for stance classification on social media texts. UTCNN learns user embeddings for all users with minimum active degree, i.e., one post or one like. Topic information obtained from the topic model or the pre-defined labels further improves the UTCNN model. In addition, comment information provides additional clues for stance classification. We have shown that UTCNN achieves promising and balanced results. In the future we plan to explore the effectiveness of the UTCNN user embeddings for author stance classification.\nAcknowledgements\nResearch of this paper was partially supported by Ministry of Science and Technology, Taiwan, under the contract MOST 104-2221-E-001-024-MY2.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["why are their techniques cheaper to implement?", "what data simulation techniques were introduced?", "what is their explanation for the effectiveness of back-translation?", "what dataset is used?", "what language pairs are explored?", "what language is the data in?"], "outputs": ["They use a slightly modified copy of the target to create the pseudo-text instead of full BT to make their technique cheaper", "copy, copy-marked, copy-dummies", "when using BT, cases where the source is shorter than the target are rarer; cases when they have the same length are more frequent, automatic word alignments between artificial sources tend to be more monotonic than when using natural sources", "Europarl corpus , WMT newstest 2014, News-Commentary-11, Wikipedia from WMT 2014, Multi-UN, EU-Bookshop, Rapid, Common-Crawl (WMT 2017)", "English-German, English-French", "English , German, French"], "input": "Introduction \nThe new generation of Neural Machine Translation (NMT) systems is known to be extremely data hungry BIBREF0 . Yet, most existing NMT training pipelines fail to fully take advantage of the very large volume of monolingual source and/or parallel data that is often available. Making a better use of data is particularly critical in domain adaptation scenarios, where parallel adaptation data is usually assumed to be small in comparison to out-of-domain parallel data, or to in-domain monolingual texts. This situation sharply contrasts with the previous generation of statistical MT engines BIBREF1 , which could seamlessly integrate very large amounts of non-parallel documents, usually with a large positive effect on translation quality.\nSuch observations have been made repeatedly and have led to many innovative techniques to integrate monolingual data in NMT, that we review shortly. The most successful approach to date is the proposal of BIBREF2 , who use monolingual target texts to generate artificial parallel data via backward translation (BT). This technique has since proven effective in many subsequent studies. It is however very computationally costly, typically requiring to translate large sets of data. Determining the \u201cright\u201d amount (and quality) of BT data is another open issue, but we observe that experiments reported in the literature only use a subset of the available monolingual resources. This suggests that standard recipes for BT might be sub-optimal.\nThis paper aims to better understand the strengths and weaknesses of BT and to design more principled techniques to improve its effects. More specifically, we seek to answer the following questions: since there are many ways to generate pseudo parallel corpora, how important is the quality of this data for MT performance? Which properties of back-translated sentences actually matter for MT quality? Does BT act as some kind of regularizer BIBREF3 ? Can BT be efficiently simulated? Does BT data play the same role as a target-side language modeling, or are they complementary? BT is often used for domain adaptation: can the effect of having more in-domain data be sorted out from the mere increase of training material BIBREF2 ? For studies related to the impact of varying the size of BT data, we refer the readers to the recent work of BIBREF4 .\nTo answer these questions, we have reimplemented several strategies to use monolingual data in NMT and have run experiments on two language pairs in a very controlled setting (see \u00a7 SECREF2 ). Our main results (see \u00a7 SECREF4 and \u00a7 SECREF5 ) suggest promising directions for efficient domain adaptation with cheaper techniques than conventional BT.\nIn-domain and out-of-domain data \nWe are mostly interested with the following training scenario: a large out-of-domain parallel corpus, and limited monolingual in-domain data. We focus here on the Europarl domain, for which we have ample data in several languages, and use as in-domain training data the Europarl corpus BIBREF5 for two translation directions: English INLINEFORM0 German and English INLINEFORM1 French. As we study the benefits of monolingual data, most of our experiments only use the target side of this corpus. The rationale for choosing this domain is to (i) to perform large scale comparisons of synthetic and natural parallel corpora; (ii) to study the effect of BT in a well-defined domain-adaptation scenario. For both language pairs, we use the Europarl tests from 2007 and 2008 for evaluation purposes, keeping test 2006 for development. When measuring out-of-domain performance, we will use the WMT newstest 2014.\nNMT setups and performance \nOur baseline NMT system implements the attentional encoder-decoder approach BIBREF6 , BIBREF7 as implemented in Nematus BIBREF8 on 4 million out-of-domain parallel sentences. For French we use samples from News-Commentary-11 and Wikipedia from WMT 2014 shared translation task, as well as the Multi-UN BIBREF9 and EU-Bookshop BIBREF10 corpora. For German, we use samples from News-Commentary-11, Rapid, Common-Crawl (WMT 2017) and Multi-UN (see table TABREF5 ). Bilingual BPE units BIBREF11 are learned with 50k merge operations, yielding vocabularies of about respectively 32k and 36k for English INLINEFORM0 French and 32k and 44k for English INLINEFORM1 German.\nBoth systems use 512-dimensional word embeddings and a single hidden layer with 1024 cells. They are optimized using Adam BIBREF12 and early stopped according to the validation performance. Training lasted for about three weeks on an Nvidia K80 GPU card.\nSystems generating back-translated data are trained using the same out-of-domain corpus, where we simply exchange the source and target sides. They are further documented in \u00a7 SECREF8 . For the sake of comparison, we also train a system that has access to a large batch of in-domain parallel data following the strategy often referred to as \u201cfine-tuning\u201d: upon convergence of the baseline model, we resume training with a 2M sentence in-domain corpus mixed with an equal amount of randomly selected out-of-domain natural sentences, with the same architecture and training parameters, running validation every 2000 updates with a patience of 10. Since BPE units are selected based only on the out-of-domain statistics, fine-tuning is performed on sentences that are slightly longer (ie. they contain more units) than for the initial training. This system defines an upper-bound of the translation performance and is denoted below as natural.\nOur baseline and topline results are in Table TABREF6 , where we measure translation performance using BLEU BIBREF13 , BEER BIBREF14 (higher is better) and characTER BIBREF15 (smaller is better). As they are trained from much smaller amounts of data than current systems, these baselines are not quite competitive to today's best system, but still represent serious baselines for these datasets. Given our setups, fine-tuning with in-domain natural data improves BLEU by almost 4 points for both translation directions on in-domain tests; it also improves, albeit by a smaller margin, the BLEU score of the out-of-domain tests.\nUsing artificial parallel data in NMT \nA simple way to use monolingual data in MT is to turn it into synthetic parallel data and let the training procedure run as usual BIBREF16 . In this section, we explore various ways to implement this strategy. We first reproduce results of BIBREF2 with BT of various qualities, that we then analyze thoroughly.\nThe quality of Back-Translation \nBT requires the availability of an MT system in the reverse translation direction. We consider here three MT systems of increasing quality:\nbacktrans-bad: this is a very poor SMT system trained using only 50k parallel sentences from the out-of-domain data, and no additional monolingual data. For this system as for the next one, we use Moses BIBREF17 out-of-the-box, computing alignments with Fastalign BIBREF18 , with a minimal pre-processing (basic tokenization). This setting provides us with a pessimistic estimate of what we could get in low-resource conditions.\nbacktrans-good: these are much larger SMT systems, which use the same parallel data as the baseline NMTs (see \u00a7 SECREF4 ) and all the English monolingual data available for the WMT 2017 shared tasks, totalling approximately 174M sentences. These systems are strong, yet relatively cheap to build.\nbacktrans-nmt: these are the best NMT systems we could train, using settings that replicate the forward translation NMTs.\nNote that we do not use any in-domain (Europarl) data to train these systems. Their performance is reported in Table TABREF7 , where we observe a 12 BLEU points gap between the worst and best systems (for both languages).\nAs noted eg. in BIBREF19 , BIBREF20 , artificial parallel data obtained through forward-translation (FT) can also prove advantageous and we also consider a FT system (fwdtrans-nmt): in this case the target side of the corpus is artificial and is generated using the baseline NMT applied to a natural source.\nOur results (see Table TABREF6 ) replicate the findings of BIBREF2 : large gains can be obtained from BT (nearly INLINEFORM0 BLEU in French and German); better artificial data yields better translation systems. Interestingly, our best Moses system is almost as good as the NMT and an order of magnitude faster to train. Improvements obtained with the bad system are much smaller; contrary to the better MTs, this system is even detrimental for the out-of-domain test.\nGains with forward translation are significant, as in BIBREF21 , albeit about half as good as with BT, and result in small improvements for the in-domain and for the out-of-domain tests. Experiments combining forward and backward translation (backfwdtrans-nmt), each using a half of the available artificial data, do not outperform the best BT results.\nWe finally note the large remaining difference between BT data and natural data, even though they only differ in their source side. This shows that at least in our domain-adaptation settings, BT does not really act as a regularizer, contrarily to the findings of BIBREF4 , BIBREF11 . Figure FIGREF13 displays the learning curves of these two systems. We observe that backtrans-nmt improves quickly in the earliest updates and then stays horizontal, whereas natural continues improving, even after 400k updates. Therefore BT does not help to avoid overfitting, it actually encourages it, which may be due \u201ceasier\u201d training examples (cf. \u00a7 SECREF15 ).\nProperties of back-translated data \nComparing the natural and artificial sources of our parallel data wrt. several linguistic and distributional properties, we observe that (see Fig. FIGREF21 - FIGREF22 ):\nartificial sources are on average shorter than natural ones: when using BT, cases where the source is shorter than the target are rarer; cases when they have the same length are more frequent.\nautomatic word alignments between artificial sources tend to be more monotonic than when using natural sources, as measured by the average Kendall INLINEFORM0 of source-target alignments BIBREF22 : for French-English the respective numbers are 0.048 (natural) and 0.018 (artificial); for German-English 0.068 and 0.053. Using more monotonic sentence pairs turns out to be a facilitating factor for NMT, as also noted by BIBREF20 .\nsyntactically, artificial sources are simpler than real data; We observe significant differences in the distributions of tree depths.\ndistributionally, plain word occurrences in artificial sources are more concentrated; this also translates into both a slower increase of the number of types wrt. the number of sentences and a smaller number of rare events.\nThe intuition is that properties (i) and (ii) should help translation as compared to natural source, while property (iv) should be detrimental. We checked (ii) by building systems with only 10M words from the natural parallel data selecting these data either randomly or based on the regularity of their word alignments. Results in Table TABREF23 show that the latter is much preferable for the overall performance. This might explain that the mostly monotonic BT from Moses are almost as good as the fluid BT from NMT and that both boost the baseline.\nStupid Back-Translation \nWe now analyze the effect of using much simpler data generation schemes, which do not require the availability of a backward translation engine.\nSetups\nWe use the following cheap ways to generate pseudo-source texts:\ncopy: in this setting, the source side is a mere copy of the target-side data. Since the source vocabulary of the NMT is fixed, copying the target sentences can cause the occurrence of OOVs. To avoid this situation, BIBREF24 decompose the target words into source-side units to make the copy look like source sentences. Each OOV found in the copy is split into smaller units until all the resulting chunks are in the source vocabulary.\ncopy-marked: another way to integrate copies without having to deal with OOVs is to augment the source vocabulary with a copy of the target vocabulary. In this setup, BIBREF25 ensure that both vocabularies never overlap by marking the target word copies with a special language identifier. Therefore the English word resume cannot be confused with the homographic French word, which is marked @fr@resume.\ncopy-dummies: instead of using actual copies, we replace each word with \u201cdummy\u201d tokens. We use this unrealistic setup to observe the training over noisy and hardly informative source sentences.\nWe then use the procedures described in \u00a7 SECREF4 , except that the pseudo-source embeddings in the copy-marked setup are pretrained for three epochs on the in-domain data, while all remaining parameters are frozen. This prevents random parameters from hurting the already trained model.\nCopy+marking+noise is not so stupid\nWe observe that the copy setup has only a small impact on the English-French system, for which the baseline is already strong. This is less true for English-German where simple copies yield a significant improvement. Performance drops for both language pairs in the copy-dummies setup.\nWe achieve our best gains with the copy-marked setup, which is the best way to use a copy of the target (although the performance on the out-of-domain tests is at most the same as the baseline). Such gains may look surprising, since the NMT model does not need to learn to translate but only to copy the source. This is indeed what happens: to confirm this, we built a fake test set having identical source and target side (in French). The average cross-entropy for this test set is 0.33, very close to 0, to be compared with an average cost of 58.52 when we process an actual source (in English). This means that the model has learned to copy words from source to target with no difficulty, even for sentences not seen in training. A follow-up question is whether training a copying task instead of a translation task limits the improvement: would the NMT learn better if the task was harder? To measure this, we introduce noise in the target sentences copied onto the source, following the procedure of BIBREF26 : it deletes random words and performs a small random permutation of the remaining words. Results (+ Source noise) show no difference for the French in-domain test sets, but bring the out-of-domain score to the level of the baseline. Finally, we observe a significant improvement on German in-domain test sets, compared to the baseline (about +1.5 BLEU). This last setup is even almost as good as the backtrans-nmt condition (see \u00a7 SECREF8 ) for German. This shows that learning to reorder and predict missing words can more effectively serve our purposes than simply learning to copy.\nTowards more natural pseudo-sources \nIntegrating monolingual data into NMT can be as easy as copying the target into the source, which already gives some improvement; adding noise makes things even better. We now study ways to make pseudo-sources look more like natural data, using the framework of Generative Adversarial Networks (GANs) BIBREF27 , an idea borrowed from BIBREF26 .\nGAN setups\nIn our setups, we use a marked target copy, viewed as a fake source, which a generator encodes so as to fool a discriminator trained to distinguish a fake from a natural source. Our architecture contains two distinct encoders, one for the natural source and one for the pseudo-source. The latter acts as the generator ( INLINEFORM0 ) in the GAN framework, computing a representation of the pseudo-source that is then input to a discriminator ( INLINEFORM1 ), which has to sort natural from artificial encodings. INLINEFORM2 assigns a probability of a sentence being natural.\nDuring training, the cost of the discriminator is computed over two batches, one with natural (out-of-domain) sentences INLINEFORM0 and one with (in-domain) pseudo-sentences INLINEFORM1 . The discriminator is a bidirectional-Recurrent Neural Network (RNN) of dimension 1024. Averaged states are passed to a single feed-forward layer, to which a sigmoid is applied. It inputs encodings of natural ( INLINEFORM2 ) and pseudo-sentences ( INLINEFORM3 ) and is trained to optimize:\nINLINEFORM0\nINLINEFORM0 's parameters are updated to maximally fool INLINEFORM1 , thus the loss INLINEFORM2 : INLINEFORM3\nFinally, we keep the usual MT objective. ( INLINEFORM0 is a real or pseudo-sentence): INLINEFORM1\nWe thus need to train three sets of parameters: INLINEFORM0 and INLINEFORM1 (MT parameters), with INLINEFORM2 . The pseudo-source encoder and embeddings are updated wrt. both INLINEFORM3 and INLINEFORM4 . Following BIBREF28 , INLINEFORM5 is updated only when INLINEFORM6 's accuracy exceeds INLINEFORM7 . On the other hand, INLINEFORM8 is not updated when its accuracy exceeds INLINEFORM9 . At each update, two batches are generated for each type of data, which are encoded with the real or pseudo-encoder. The encoder outputs serve to compute INLINEFORM10 and INLINEFORM11 . Finally, the pseudo-source is encoded again (once INLINEFORM12 is updated), both encoders are plugged into the translation model and the MT cost is back-propagated down to the real and pseudo-word embeddings. Pseudo-encoder and discriminator parameters are pre-trained for 10k updates. At test time, the pseudo-encoder is ignored and inference is run as usual.\nGANs can help\nResults are in Table TABREF32 , assuming the same fine-tuning procedure as above. On top of the copy-marked setup, our GANs do not provide any improvement in both language pairs, with the exception of a small improvement for English-French on the out-of-domain test, which we understand as a sign that the model is more robust to domain variations, just like when adding pseudo-source noise. When combined with noise, the French model yields the best performance we could obtain with stupid BT on the in-domain tests, at least in terms of BLEU and BEER. On the News domain, we remain close to the baseline level, with slight improvements in German.\nA first observation is that this method brings stupid BT models closer to conventional BT, at a greatly reduced computational cost. While French still remains 0.4 to 1.0 BLEU below very good backtranslation, both approaches are in the same ballpark for German - may be because BTs are better for the former system than for the latter.\nFinally note that the GAN architecture has two differences with basic copy-marked: (a) a distinct encoder for real and pseudo-sentence; (b) a different training regime for these encoders. To sort out the effects of (a) and (b), we reproduce the GAN setup with BT sentences, instead of copies. Using a separate encoder for the pseudo-source in the backtrans-nmt setup can be detrimental to performance (see Table TABREF32 ): translation degrades in French for all metrics. Adding GANs on top of the pseudo-encoder was not able to make up for the degradation observed in French, but allowed the German system to slightly outperform backtrans-nmt. Even though this setup is unrealistic and overly costly, it shows that GANs are actually helping even good systems.\nUsing Target Language Models \nIn this section, we compare the previous methods with the use of a target side Language Model (LM). Several proposals exist in the literature to integrate LMs in NMT: for instance, BIBREF3 strengthen the decoder by integrating an extra, source independent, RNN layer in a conventional NMT architecture. Training is performed either with parallel, or monolingual data. In the latter case, word prediction only relies on the source independent part of the network.\nLM Setup\nWe have followed BIBREF29 and reimplemented their deep-fusion technique. It requires to first independently learn a RNN-LM on the in-domain target data with a cross-entropy objective; then to train the optimal combination of the translation and the language models by adding the hidden state of the RNN-LM as an additional input to the softmax layer of the decoder. Our RNN-LMs are trained using dl4mt with the target side of the parallel data and the Europarl corpus (about 6M sentences for both French and German), using a one-layer GRU with the same dimension as the MT decoder (1024).\nLM Results\nResults are in Table TABREF33 . They show that deep-fusion hardly improves the Europarl results, while we obtain about +0.6 BLEU over the baseline on newstest-2014 for both languages. deep-fusion differs from stupid BT in that the model is not directly optimized on the in-domain data, but uses the LM trained on Europarl to maximize the likelihood of the out-of-domain training data. Therefore, no specific improvement is to be expected in terms of domain adaptation, and the performance increases in the more general domain. Combining deep-fusion and copy-marked + noise + GANs brings slight improvements on the German in-domain test sets, and performance out of the domain remains near the baseline level.\nRe-analyzing the effects of BT \nAs a follow up of previous discussions, we analyze the effect of BT on the internals of the network. Arguably, using a copy of the target sentence instead of a natural source should not be helpful for the encoder, but is it also the case with a strong BT? What are the effects on the attention model?\nParameter freezing protocol\nTo investigate these questions, we run the same fine-tuning using the copy-marked, backtrans-nmt and backtrans-nmt setups. Note that except for the last one, all training scenarios have access to same target training data. We intend to see whether the overall performance of the NMT system degrades when we selectively freeze certain sets of parameters, meaning that they are not updated during fine-tuning.\nResults\nBLEU scores are in Table TABREF39 . The backtrans-nmt setup is hardly impacted by selective updates: updating the only decoder leads to a degradation of at most 0.2 BLEU. For copy-marked, we were not able to freeze the source embeddings, since these are initialized when fine-tuning begins and therefore need to be trained. We observe that freezing the encoder and/or the attention parameters has no impact on the English-German system, whereas it slightly degrades the English-French one. This suggests that using artificial sources, even of the poorest quality, has a positive impact on all the components of the network, which makes another big difference with the LM integration scenario.\nThe largest degradation is for natural, where the model is prevented from learning from informative source sentences, which leads to a decrease of 0.4 to over 1.0 BLEU. We assume from these experiments that BT impacts most of all the decoder, and learning to encode a pseudo-source, be it a copy or an actual back-translation, only marginally helps to significantly improve the quality. Finally, in the fwdtrans-nmt setup, freezing the decoder does not seem to harm learning with a natural source.\nRelated work\nThe literature devoted to the use of monolingual data is large, and quickly expanding. We already alluded to several possible ways to use such data: using back- or forward-translation or using a target language model. The former approach is mostly documented in BIBREF2 , and recently analyzed in BIBREF19 , which focus on fully artificial settings as well as pivot-based artificial data; and BIBREF4 , which studies the effects of increasing the size of BT data. The studies of BIBREF20 , BIBREF19 also consider forward translation and BIBREF21 expand these results to domain adaptation scenarios. Our results are complementary to these earlier studies.\nAs shown above, many alternatives to BT exist. The most obvious is to use target LMs BIBREF3 , BIBREF29 , as we have also done here; but attempts to improve the encoder using multi-task learning also exist BIBREF30 .\nThis investigation is also related to recent attempts to consider supplementary data with a valid target side, such as multi-lingual NMT BIBREF31 , where source texts in several languages are fed in the same encoder-decoder architecture, with partial sharing of the layers. This is another realistic scenario where additional resources can be used to selectively improve parts of the model.\nRound trip training is another important source of inspiration, as it can be viewed as a way to use BT to perform semi-unsupervised BIBREF32 or unsupervised BIBREF33 training of NMT. The most convincing attempt to date along these lines has been proposed by BIBREF26 , who propose to use GANs to mitigate the difference between artificial and natural data.\nConclusion \nIn this paper, we have analyzed various ways to integrate monolingual data in an NMT framework, focusing on their impact on quality and domain adaptation. While confirming the effectiveness of BT, our study also proposed significantly cheaper ways to improve the baseline performance, using a slightly modified copy of the target, instead of its full BT. When no high quality BT is available, using GANs to make the pseudo-source sentences closer to natural source sentences is an efficient solution for domain adaptation.\nTo recap our answers to our initial questions: the quality of BT actually matters for NMT (cf. \u00a7 SECREF8 ) and it seems that, even though artificial source are lexically less diverse and syntactically complex than real sentence, their monotonicity is a facilitating factor. We have studied cheaper alternatives and found out that copies of the target, if properly noised (\u00a7 SECREF4 ), and even better, if used with GANs, could be almost as good as low quality BTs (\u00a7 SECREF5 ): BT is only worth its cost when good BT can be generated. Finally, BT seems preferable to integrating external LM - at least in our data condition (\u00a7 SECREF6 ). Further experiments with larger LMs are needed to confirm this observation, and also to evaluate the complementarity of both strategies. More work is needed to better understand the impact of BT on subparts of the network (\u00a7 SECREF7 ).\nIn future work, we plan to investigate other cheap ways to generate artificial data. The experimental setup we proposed may also benefit from a refining of the data selection strategies to focus on the most useful monolingual sentences.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What metrics are used for evaluation?", "What natural language(s) are the recipes written in?", "What were their results on the new dataset?", "What are the baseline models?", "How did they obtain the interactions?", "Where do they get the recipes from?"], "outputs": ["BLEU-1/4 and ROUGE-L, likelihood of generated recipes using identical input specifications but conditioned on ten different user profiles, user matching accuracy (UMA), Mean Reciprocal Rank (MRR), neural scoring model from BIBREF33 to measure recipe-level coherence", "English", "average recipe-level coherence scores of 1.78-1.82, human evaluators preferred personalized model outputs to baseline 63% of the time", "name-based Nearest-Neighbor model (NN), Encoder-Decoder baseline with ingredient attention (Enc-Dec)", "from Food.com", "from Food.com"], "input": "Introduction\nIn the kitchen, we increasingly rely on instructions from cooking websites: recipes. A cook with a predilection for Asian cuisine may wish to prepare chicken curry, but may not know all necessary ingredients apart from a few basics. These users with limited knowledge cannot rely on existing recipe generation approaches that focus on creating coherent recipes given all ingredients and a recipe name BIBREF0. Such models do not address issues of personal preference (e.g. culinary tastes, garnish choices) and incomplete recipe details. We propose to approach both problems via personalized generation of plausible, user-specific recipes using user preferences extracted from previously consumed recipes.\nOur work combines two important tasks from natural language processing and recommender systems: data-to-text generation BIBREF1 and personalized recommendation BIBREF2. Our model takes as user input the name of a specific dish, a few key ingredients, and a calorie level. We pass these loose input specifications to an encoder-decoder framework and attend on user profiles\u2014learned latent representations of recipes previously consumed by a user\u2014to generate a recipe personalized to the user's tastes. We fuse these `user-aware' representations with decoder output in an attention fusion layer to jointly determine text generation. Quantitative (perplexity, user-ranking) and qualitative analysis on user-aware model outputs confirm that personalization indeed assists in generating plausible recipes from incomplete ingredients.\nWhile personalized text generation has seen success in conveying user writing styles in the product review BIBREF3, BIBREF4 and dialogue BIBREF5 spaces, we are the first to consider it for the problem of recipe generation, where output quality is heavily dependent on the content of the instructions\u2014such as ingredients and cooking techniques.\nTo summarize, our main contributions are as follows:\nWe explore a new task of generating plausible and personalized recipes from incomplete input specifications by leveraging historical user preferences;\nWe release a new dataset of 180K+ recipes and 700K+ user reviews for this task;\nWe introduce new evaluation strategies for generation quality in instructional texts, centering on quantitative measures of coherence. We also show qualitatively and quantitatively that personalized models generate high-quality and specific recipes that align with historical user preferences.\nRelated Work\nLarge-scale transformer-based language models have shown surprising expressivity and fluency in creative and conditional long-text generation BIBREF6, BIBREF7. Recent works have proposed hierarchical methods that condition on narrative frameworks to generate internally consistent long texts BIBREF8, BIBREF9, BIBREF10. Here, we generate procedurally structured recipes instead of free-form narratives.\nRecipe generation belongs to the field of data-to-text natural language generation BIBREF1, which sees other applications in automated journalism BIBREF11, question-answering BIBREF12, and abstractive summarization BIBREF13, among others. BIBREF14, BIBREF15 model recipes as a structured collection of ingredient entities acted upon by cooking actions. BIBREF0 imposes a `checklist' attention constraint emphasizing hitherto unused ingredients during generation. BIBREF16 attend over explicit ingredient references in the prior recipe step. Similar hierarchical approaches that infer a full ingredient list to constrain generation will not help personalize recipes, and would be infeasible in our setting due to the potentially unconstrained number of ingredients (from a space of 10K+) in a recipe. We instead learn historical preferences to guide full recipe generation.\nA recent line of work has explored user- and item-dependent aspect-aware review generation BIBREF3, BIBREF4. This work is related to ours in that it combines contextual language generation with personalization. Here, we attend over historical user preferences from previously consumed recipes to generate recipe content, rather than writing styles.\nApproach\nOur model's input specification consists of: the recipe name as a sequence of tokens, a partial list of ingredients, and a caloric level (high, medium, low). It outputs the recipe instructions as a token sequence: $\\mathcal {W}_r=\\lbrace w_{r,0}, \\dots , w_{r,T}\\rbrace $ for a recipe $r$ of length $T$. To personalize output, we use historical recipe interactions of a user $u \\in \\mathcal {U}$.\nEncoder: Our encoder has three embedding layers: vocabulary embedding $\\mathcal {V}$, ingredient embedding $\\mathcal {I}$, and caloric-level embedding $\\mathcal {C}$. Each token in the (length $L_n$) recipe name is embedded via $\\mathcal {V}$; the embedded token sequence is passed to a two-layered bidirectional GRU (BiGRU) BIBREF17, which outputs hidden states for names $\\lbrace \\mathbf {n}_{\\text{enc},j} \\in \\mathbb {R}^{2d_h}\\rbrace $, with hidden size $d_h$. Similarly each of the $L_i$ input ingredients is embedded via $\\mathcal {I}$, and the embedded ingredient sequence is passed to another two-layered BiGRU to output ingredient hidden states as $\\lbrace \\mathbf {i}_{\\text{enc},j} \\in \\mathbb {R}^{2d_h}\\rbrace $. The caloric level is embedded via $\\mathcal {C}$ and passed through a projection layer with weights $W_c$ to generate calorie hidden representation $\\mathbf {c}_{\\text{enc}} \\in \\mathbb {R}^{2d_h}$.\nIngredient Attention: We apply attention BIBREF18 over the encoded ingredients to use encoder outputs at each decoding time step. We define an attention-score function $\\alpha $ with key $K$ and query $Q$:\nwith trainable weights $W_{\\alpha }$, bias $\\mathbf {b}_{\\alpha }$, and normalization term $Z$. At decoding time $t$, we calculate the ingredient context $\\mathbf {a}_{t}^{i} \\in \\mathbb {R}^{d_h}$ as:\nDecoder: The decoder is a two-layer GRU with hidden state $h_t$ conditioned on previous hidden state $h_{t-1}$ and input token $w_{r, t}$ from the original recipe text. We project the concatenated encoder outputs as the initial decoder hidden state:\nTo bias generation toward user preferences, we attend over a user's previously reviewed recipes to jointly determine the final output token distribution. We consider two different schemes to model preferences from user histories: (1) recipe interactions, and (2) techniques seen therein (defined in data). BIBREF19, BIBREF20, BIBREF21 explore similar schemes for personalized recommendation.\nPrior Recipe Attention: We obtain the set of prior recipes for a user $u$: $R^+_u$, where each recipe can be represented by an embedding from a recipe embedding layer $\\mathcal {R}$ or an average of the name tokens embedded by $\\mathcal {V}$. We attend over the $k$-most recent prior recipes, $R^{k+}_u$, to account for temporal drift of user preferences BIBREF22. These embeddings are used in the `Prior Recipe' and `Prior Name' models, respectively.\nGiven a recipe representation $\\mathbf {r} \\in \\mathbb {R}^{d_r}$ (where $d_r$ is recipe- or vocabulary-embedding size depending on the recipe representation) the prior recipe attention context $\\mathbf {a}_{t}^{r_u}$ is calculated as\nPrior Technique Attention: We calculate prior technique preference (used in the `Prior Tech` model) by normalizing co-occurrence between users and techniques seen in $R^+_u$, to obtain a preference vector $\\rho _{u}$. Each technique $x$ is embedded via a technique embedding layer $\\mathcal {X}$ to $\\mathbf {x}\\in \\mathbb {R}^{d_x}$. Prior technique attention is calculated as\nwhere, inspired by copy mechanisms BIBREF23, BIBREF24, we add $\\rho _{u,x}$ for technique $x$ to emphasize the attention by the user's prior technique preference.\nAttention Fusion Layer: We fuse all contexts calculated at time $t$, concatenating them with decoder GRU output and previous token embedding:\nWe then calculate the token probability:\nand maximize the log-likelihood of the generated sequence conditioned on input specifications and user preferences. fig:ex shows a case where the Prior Name model attends strongly on previously consumed savory recipes to suggest the usage of an additional ingredient (`cilantro').\nRecipe Dataset: Food.com\nWe collect a novel dataset of 230K+ recipe texts and 1M+ user interactions (reviews) over 18 years (2000-2018) from Food.com. Here, we restrict to recipes with at least 3 steps, and at least 4 and no more than 20 ingredients. We discard users with fewer than 4 reviews, giving 180K+ recipes and 700K+ reviews, with splits as in tab:recipeixnstats.\nOur model must learn to generate from a diverse recipe space: in our training data, the average recipe length is 117 tokens with a maximum of 256. There are 13K unique ingredients across all recipes. Rare words dominate the vocabulary: 95% of words appear $<$100 times, accounting for only 1.65% of all word usage. As such, we perform Byte-Pair Encoding (BPE) tokenization BIBREF25, BIBREF26, giving a training vocabulary of 15K tokens across 19M total mentions. User profiles are similarly diverse: 50% of users have consumed $\\le $6 recipes, while 10% of users have consumed $>$45 recipes.\nWe order reviews by timestamp, keeping the most recent review for each user as the test set, the second most recent for validation, and the remainder for training (sequential leave-one-out evaluation BIBREF27). We evaluate only on recipes not in the training set.\nWe manually construct a list of 58 cooking techniques from 384 cooking actions collected by BIBREF15; the most common techniques (bake, combine, pour, boil) account for 36.5% of technique mentions. We approximate technique adherence via string match between the recipe text and technique list.\nExperiments and Results\nFor training and evaluation, we provide our model with the first 3-5 ingredients listed in each recipe. We decode recipe text via top-$k$ sampling BIBREF7, finding $k=3$ to produce satisfactory results. We use a hidden size $d_h=256$ for both the encoder and decoder. Embedding dimensions for vocabulary, ingredient, recipe, techniques, and caloric level are 300, 10, 50, 50, and 5 (respectively). For prior recipe attention, we set $k=20$, the 80th %-ile for the number of user interactions. We use the Adam optimizer BIBREF28 with a learning rate of $10^{-3}$, annealed with a decay rate of 0.9 BIBREF29. We also use teacher-forcing BIBREF30 in all training epochs.\nIn this work, we investigate how leveraging historical user preferences can improve generation quality over strong baselines in our setting. We compare our personalized models against two baselines. The first is a name-based Nearest-Neighbor model (NN). We initially adapted the Neural Checklist Model of BIBREF0 as a baseline; however, we ultimately use a simple Encoder-Decoder baseline with ingredient attention (Enc-Dec), which provides comparable performance and lower complexity. All personalized models outperform baseline in BPE perplexity (tab:metricsontest) with Prior Name performing the best. While our models exhibit comparable performance to baseline in BLEU-1/4 and ROUGE-L, we generate more diverse (Distinct-1/2: percentage of distinct unigrams and bigrams) and acceptable recipes. BLEU and ROUGE are not the most appropriate metrics for generation quality. A `correct' recipe can be written in many ways with the same main entities (ingredients). As BLEU-1/4 capture structural information via n-gram matching, they are not correlated with subjective recipe quality. This mirrors observations from BIBREF31, BIBREF8.\nWe observe that personalized models make more diverse recipes than baseline. They thus perform better in BLEU-1 with more key entities (ingredient mentions) present, but worse in BLEU-4, as these recipes are written in a personalized way and deviate from gold on the phrasal level. Similarly, the `Prior Name' model generates more unigram-diverse recipes than other personalized models and obtains a correspondingly lower BLEU-1 score.\nQualitative Analysis: We present sample outputs for a cocktail recipe in tab:samplerecipes, and additional recipes in the appendix. Generation quality progressively improves from generic baseline output to a blended cocktail produced by our best performing model. Models attending over prior recipes explicitly reference ingredients. The Prior Name model further suggests the addition of lemon and mint, which are reasonably associated with previously consumed recipes like coconut mousse and pork skewers.\nPersonalization: To measure personalization, we evaluate how closely the generated text corresponds to a particular user profile. We compute the likelihood of generated recipes using identical input specifications but conditioned on ten different user profiles\u2014one `gold' user who consumed the original recipe, and nine randomly generated user profiles. Following BIBREF8, we expect the highest likelihood for the recipe conditioned on the gold user. We measure user matching accuracy (UMA)\u2014the proportion where the gold user is ranked highest\u2014and Mean Reciprocal Rank (MRR) BIBREF32 of the gold user. All personalized models beat baselines in both metrics, showing our models personalize generated recipes to the given user profiles. The Prior Name model achieves the best UMA and MRR by a large margin, revealing that prior recipe names are strong signals for personalization. Moreover, the addition of attention mechanisms to capture these signals improves language modeling performance over a strong non-personalized baseline.\nRecipe Level Coherence: A plausible recipe should possess a coherent step order, and we evaluate this via a metric for recipe-level coherence. We use the neural scoring model from BIBREF33 to measure recipe-level coherence for each generated recipe. Each recipe step is encoded by BERT BIBREF34. Our scoring model is a GRU network that learns the overall recipe step ordering structure by minimizing the cosine similarity of recipe step hidden representations presented in the correct and reverse orders. Once pretrained, our scorer calculates the similarity of a generated recipe to the forward and backwards ordering of its corresponding gold label, giving a score equal to the difference between the former and latter. A higher score indicates better step ordering (with a maximum score of 2). tab:coherencemetrics shows that our personalized models achieve average recipe-level coherence scores of 1.78-1.82, surpassing the baseline at 1.77.\nRecipe Step Entailment: Local coherence is also crucial to a user following a recipe: it is crucial that subsequent steps are logically consistent with prior ones. We model local coherence as an entailment task: predicting the likelihood that a recipe step follows the preceding. We sample several consecutive (positive) and non-consecutive (negative) pairs of steps from each recipe. We train a BERT BIBREF34 model to predict the entailment score of a pair of steps separated by a [SEP] token, using the final representation of the [CLS] token. The step entailment score is computed as the average of scores for each set of consecutive steps in each recipe, averaged over every generated recipe for a model, as shown in tab:coherencemetrics.\nHuman Evaluation: We presented 310 pairs of recipes for pairwise comparison BIBREF8 (details in appendix) between baseline and each personalized model, with results shown in tab:metricsontest. On average, human evaluators preferred personalized model outputs to baseline 63% of the time, confirming that personalized attention improves the semantic plausibility of generated recipes. We also performed a small-scale human coherence survey over 90 recipes, in which 60% of users found recipes generated by personalized models to be more coherent and preferable to those generated by baseline models.\nConclusion\nIn this paper, we propose a novel task: to generate personalized recipes from incomplete input specifications and user histories. On a large novel dataset of 180K recipes and 700K reviews, we show that our personalized generative models can generate plausible, personalized, and coherent recipes preferred by human evaluators for consumption. We also introduce a set of automatic coherence measures for instructional texts as well as personalization metrics to support our claims. Our future work includes generating structured representations of recipes to handle ingredient properties, as well as accounting for references to collections of ingredients (e.g. \u201cdry mix\").\nAcknowledgements. This work is partly supported by NSF #1750063. We thank all reviewers for their constructive suggestions, as well as Rei M., Sujoy P., Alicia L., Eric H., Tim S., Kathy C., Allen C., and Micah I. for their feedback.\nAppendix ::: Food.com: Dataset Details\nOur raw data consists of 270K recipes and 1.4M user-recipe interactions (reviews) scraped from Food.com, covering a period of 18 years (January 2000 to December 2018). See tab:int-stats for dataset summary statistics, and tab:samplegk for sample information about one user-recipe interaction and the recipe involved.\nAppendix ::: Generated Examples\nSee tab:samplechx for a sample recipe for chicken chili and tab:samplewaffle for a sample recipe for sweet waffles.\nHuman Evaluation\nWe prepared a set of 15 pairwise comparisons per evaluation session, and collected 930 pairwise evaluations (310 per personalized model) over 62 sessions. For each pair, users were given a partial recipe specification (name and 3-5 key ingredients), as well as two generated recipes labeled `A' and `B'. One recipe is generated from our baseline encoder-decoder model and one recipe is generated by one of our three personalized models (Prior Tech, Prior Name, Prior Recipe). The order of recipe presentation (A/B) is randomly selected for each question. A screenshot of the user evaluation interface is given in fig:exeval. We ask the user to indicate which recipe they find more coherent, and which recipe best accomplishes the goal indicated by the recipe name. A screenshot of this survey interface is given in fig:exeval2.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["what are their results on the constructed dataset?", "what evaluation metrics are reported?", "what civil field is the dataset about?", "what are the state-of-the-art models?", "what is the size of the real-world civil case dataset?", "what datasets are used in the experiment?"], "outputs": ["AutoJudge consistently and significantly outperforms all the baselines, RC models achieve better performance than most text classification models (excluding GRU+Attention), Comparing with conventional RC models, AutoJudge achieves significant improvement", "precision, recall, F1 and accuracy", "divorce", "SVM , CNN , GRU , CNN/GRU+law, r-net , AoA ", "100 000 documents", "build a new one, collect INLINEFORM0 cases from China Judgments Online"], "input": "Introduction\nAutomatic judgment prediction is to train a machine judge to determine whether a certain plea in a given civil case would be supported or rejected. In countries with civil law system, e.g. mainland China, such process should be done with reference to related law articles and the fact description, as is performed by a human judge. The intuition comes from the fact that under civil law system, law articles act as principles for juridical judgments. Such techniques would have a wide range of promising applications. On the one hand, legal consulting systems could provide better access to high-quality legal resources in a low-cost way to legal outsiders, who suffer from the complicated terminologies. On the other hand, machine judge assistants for professionals would help improve the efficiency of the judicial system. Besides, automated judgment system can help in improving juridical equality and transparency. From another perspective, there are currently 7 times much more civil cases than criminal cases in mainland China, with annual rates of increase of INLINEFORM0 and INLINEFORM1 respectively, making judgment prediction in civil cases a promising application BIBREF0 .\nPrevious works BIBREF1 , BIBREF2 , BIBREF3 , BIBREF4 formalize judgment prediction as the text classification task, regarding either charge names or binary judgments, i.e., support or reject, as the target classes. These works focus on the situation where only one result is expected, e.g., the US Supreme Court's decisions BIBREF2 , and the charge name prediction for criminal cases BIBREF3 . Despite these recent efforts and their progress, automatic judgment prediction in civil law system is still confronted with two main challenges:\nOne-to-Many Relation between Case and Plea. Every single civil case may contain multiple pleas and the result of each plea is co-determined by related law articles and specific aspects of the involved case. For example, in divorce proceedings, judgment of alienation of mutual affection is the key factor for granting divorce but custody of children depends on which side can provide better an environment for children's growth as well as parents' financial condition. Here, different pleas are independent.\nHeterogeneity of Input Triple. Inputs to a judgment prediction system consist of three heterogeneous yet complementary parts, i.e., fact description, plaintiff's plea, and related law articles. Concatenating them together and treating them simply as a sequence of words as in previous works BIBREF2 , BIBREF1 would cause a great loss of information. This is the same in question-answering where the dual inputs, i.e., query and passage, should be modeled separately.\nDespite the introduction of the neural networks that can learn better semantic representations of input text, it remains unsolved to incorporate proper mechanisms to integrate the complementary triple of pleas, fact descriptions, and law articles together.\nInspired by recent advances in question answering (QA) based reading comprehension (RC) BIBREF5 , BIBREF6 , BIBREF7 , BIBREF8 , we propose the Legal Reading Comprehension (LRC) framework for automatic judgment prediction. LRC incorporates the reading mechanism for better modeling of the complementary inputs above-mentioned, as is done by human judges when referring to legal materials in search of supporting law articles. Reading mechanism, by simulating how human connects and integrates multiple text, has proven an effective module in RC tasks. We argue that applying the reading mechanism in a proper way among the triplets can obtain a better understanding and more informative representation of the original text, and further improve performance . To instantiate the framework, we propose an end-to-end neural network model named AutoJudge.\nFor experiments, we train and evaluate our models in the civil law system of mainland China. We collect and construct a large-scale real-world data set of INLINEFORM0 case documents that the Supreme People's Court of People's Republic of China has made publicly available. Fact description, pleas, and results can be extracted easily from these case documents with regular expressions, since the original documents have special typographical characteristics indicating the discourse structure. We also take into account law articles and their corresponding juridical interpretations. We also implement and evaluate previous methods on our dataset, which prove to be strong baselines.\nOur experiment results show significant improvements over previous methods. Further experiments demonstrate that our model also achieves considerable improvement over other off-the-shelf state-of-the-art models under classification and question answering framework respectively. Ablation tests carried out by taking off some components of our model further prove its robustness and effectiveness.\nTo sum up, our contributions are as follows:\n(1) We introduce reading mechanism and re-formalize judgment prediction as Legal Reading Comprehension to better model the complementary inputs.\n(2) We construct a real-world dataset for experiments, and plan to publish it for further research.\n(3) Besides baselines from previous works, we also carry out comprehensive experiments comparing different existing deep neural network methods on our dataset. Supported by these experiments, improvements achieved by LRC prove to be robust.\nJudgment Prediction\nAutomatic judgment prediction has been studied for decades. At the very first stage of judgment prediction studies, researchers focus on mathematical and statistical analysis of existing cases, without any conclusions or methodologies on how to predict them BIBREF9 , BIBREF10 , BIBREF11 , BIBREF12 , BIBREF13 , BIBREF14 . Recent attempts consider judgment prediction under the text classification framework. Most of these works extract efficient features from text (e.g., N-grams) BIBREF15 , BIBREF4 , BIBREF1 , BIBREF16 , BIBREF17 or case profiles (e.g., dates, terms, locations and types) BIBREF2 . All these methods require a large amount of human effort to design features or annotate cases. Besides, they also suffer from generalization issue when applied to other scenarios.\nMotivated by the successful application of deep neural networks, Luo et al. BIBREF3 introduce an attention-based neural model to predict charges of criminal cases, and verify the effectiveness of taking law articles into consideration. Nevertheless, they still fall into the text classification framework and lack the ability to handle multiple inputs with more complicated structures.\nText Classification\nAs the basis of previous judgment prediction works, typical text classification task takes a single text content as input and predicts the category it belongs to. Recent works usually employ neural networks to model the internal structure of a single input BIBREF18 , BIBREF19 , BIBREF20 , BIBREF21 .\nThere also exists another thread of text classification called entailment prediction. Methods proposed in BIBREF22 , BIBREF23 are intended for complementary inputs, but the mechanisms can be considered as a simplified version of reading comprehension.\nReading Comprehension\nReading comprehension is a relevant task to model heterogeneous and complementary inputs, where an answer is predicted given two channels of inputs, i.e. a textual passage and a query. Considerable progress has been made BIBREF6 , BIBREF24 , BIBREF5 . These models employ various attention mechanism to model the interaction between passage and query. Inspired by the advantage of reading comprehension models on modeling multiple inputs, we apply this idea into the legal area and propose legal reading comprehension for judgment prediction.\nConventional Reading Comprehension\nConventional reading comprehension BIBREF25 , BIBREF26 , BIBREF7 , BIBREF8 usually considers reading comprehension as predicting the answer given a passage and a query, where the answer could be a single word, a text span of the original passage, chosen from answer candidates, or generated by human annotators.\nGenerally, an instance in RC is represented as a triple INLINEFORM0 , where INLINEFORM1 , INLINEFORM2 and INLINEFORM3 correspond to INLINEFORM4 , INLINEFORM5 and INLINEFORM6 respectively. Given a triple INLINEFORM7 , RC takes the pair INLINEFORM8 as the input and employs attention-based neural models to construct an efficient representation. Afterwards, the representation is fed into the output layer to select or generate an INLINEFORM9 .\nLegal Reading Comprehension\nExisting works usually formalize judgment prediction as a text classification task and focus on extracting well-designed features of specific cases. Such simplification ignores that the judgment of a case is determined by its fact description and multiple pleas. Moreover, the final judgment should act up to the legal provisions, especially in civil law systems. Therefore, how to integrate the information (i.e., fact descriptions, pleas, and law articles) in a reasonable way is critical for judgment prediction.\nInspired by the successful application of RC, we propose a framework of Legal Reading Comprehension(LRC) for judgment prediction in the legal area. As illustrated in Fig. FIGREF1 , for each plea in a given case, the prediction of judgment result is made based the fact description and the potentially relevant law articles.\nIn a nutshell, LRC can be formalized as the following quadruplet task: DISPLAYFORM0\nwhere INLINEFORM0 is the fact description, INLINEFORM1 is the plea, INLINEFORM2 is the law articles and INLINEFORM3 is the result. Given INLINEFORM4 , LRC aims to predict the judgment result as DISPLAYFORM0\nThe probability is calculated with respect to the interaction among the triple INLINEFORM0 , which will draw on the experience of the interaction between INLINEFORM1 pairs in RC.\nTo summarize, LRC is innovative in the following aspects:\n(1) While previous works fit the problem into text classification framework, LRC re-formalizes the way to approach such problems. This new framework provides the ability to deal with the heterogeneity of the complementary inputs.\n(2) Rather than employing conventional RC models to handle pair-wise text information in the legal area, LRC takes the critical law articles into consideration and models the facts, pleas, and law articles jointly for judgment prediction, which is more suitable to simulate the human mode of dealing with cases.\nMethods\nWe propose a novel judgment prediction model AutoJudge to instantiate the LRC framework. As shown in Fig. FIGREF6 , AutoJudge consists of three flexible modules, including a text encoder, a pair-wise attentive reader, and an output module.\nIn the following parts, we give a detailed introduction to these three modules.\nText Encoder\nAs illustrated in Fig. FIGREF6 , Text Encoder aims to encode the word sequences of inputs into continuous representation sequences.\nFormally, consider a fact description INLINEFORM0 , a plea INLINEFORM1 , and the relevant law articles INLINEFORM2 , where INLINEFORM3 denotes the INLINEFORM4 -th word in the sequence and INLINEFORM5 are the lengths of word sequences INLINEFORM6 respectively. First, we convert the words to their respective word embeddings to obtain INLINEFORM7 , INLINEFORM8 and INLINEFORM9 , where INLINEFORM10 . Afterwards, we employ bi-directional GRU BIBREF27 , BIBREF28 , BIBREF29 to produce the encoded representation INLINEFORM11 of all words as follows: DISPLAYFORM0\nNote that, we adopt different bi-directional GRUs to encode fact descriptions, pleas, and law articles respectively(denoted as INLINEFORM0 , INLINEFORM1 , and INLINEFORM2 ). With these text encoders, INLINEFORM3 , INLINEFORM4 , and INLINEFORM5 are converting into INLINEFORM6 , INLINEFORM7 , and INLINEFORM8 .\nPair-Wise Attentive Reader\nHow to model the interactions among the input text is the most important problem in reading comprehension. In AutoJudge, we employ a pair-wise attentive reader to process INLINEFORM0 and INLINEFORM1 respectively. More specifically, we propose to use pair-wise mutual attention mechanism to capture the complex semantic interaction between text pairs, as well as increasing the interpretability of AutoJudge.\nFor each input pair INLINEFORM0 or INLINEFORM1 , we employ pair-wise mutual attention to select relevant information from fact descriptions INLINEFORM2 and produce more informative representation sequences.\nAs a variant of the original attention mechanism BIBREF28 , we design the pair-wise mutual attention unit as a GRU with internal memories denoted as mGRU.\nTaking the representation sequence pair INLINEFORM0 for instance, mGRU stores the fact sequence INLINEFORM1 into its memories. For each timestamp INLINEFORM2 , it selects relevant fact information INLINEFORM3 from the memories as follows, DISPLAYFORM0\nHere, the weight INLINEFORM0 is the softmax value as DISPLAYFORM0\nNote that, INLINEFORM0 represents the relevance between INLINEFORM1 and INLINEFORM2 . It is calculated as follows, DISPLAYFORM0\nHere, INLINEFORM0 is the last hidden state in the GRU, which will be introduced in the following part. INLINEFORM1 is a weight vector, and INLINEFORM2 , INLINEFORM3 , INLINEFORM4 are attention metrics of our proposed pair-wise attention mechanism.\nWith the relevant fact information INLINEFORM0 and INLINEFORM1 , we get the INLINEFORM2 -th input of mGRU as DISPLAYFORM0\nwhere INLINEFORM0 indicates the concatenation operation.\nThen, we feed INLINEFORM0 into GRU to get more informative representation sequence INLINEFORM1 as follows, DISPLAYFORM0\nFor the input pair INLINEFORM0 , we can get INLINEFORM1 in the same way. Therefore, we omit the implementation details Here.\nSimilar structures with attention mechanism are also applied in BIBREF5 , BIBREF30 , BIBREF31 , BIBREF28 to obtain mutually aware representations in reading comprehension models, which significantly improve the performance of this task.\nOutput Layer\nUsing text encoder and pair-wise attentive reader, the initial input triple INLINEFORM0 has been converted into two sequences, i.e., INLINEFORM1 and INLINEFORM2 , where INLINEFORM3 is defined similarly to INLINEFORM4 . These sequences reserve complex semantic information about the pleas and law articles, and filter out irrelevant information in fact descriptions.\nWith these two sequences, we concatenate INLINEFORM0 and INLINEFORM1 along the sequence length dimension to generate the sequence INLINEFORM2 . Since we have employed several GRU layers to encode the sequential inputs, another recurrent layer may be redundant. Therefore, we utilize a 1-layer CNN BIBREF18 to capture the local structure and generate the representation vector for the final prediction.\nAssuming INLINEFORM0 is the predicted probability that the plea in the case sample would be supported and INLINEFORM1 is the gold standard, AutoJudge aims to minimize the cross-entropy as follows, DISPLAYFORM0\nwhere INLINEFORM0 is the number of training data. As all the calculation in our model is differentiable, we employ Adam BIBREF32 for optimization.\nExperiments\nTo evaluate the proposed LRC framework and the AutoJudge model, we carry out a series of experiments on the divorce proceedings, a typical yet complex field of civil cases. Divorce proceedings often come with several kinds of pleas, e.g. seeking divorce, custody of children, compensation, and maintenance, which focuses on different aspects and thus makes it a challenge for judgment prediction.\nDataset Construction for Evaluation\nSince none of the datasets from previous works have been published, we decide to build a new one. We randomly collect INLINEFORM0 cases from China Judgments Online, among which INLINEFORM1 cases are for training, INLINEFORM2 each for validation and testing. Among the original cases, INLINEFORM3 are granted divorce and others not. There are INLINEFORM4 valid pleas in total, with INLINEFORM5 supported and INLINEFORM6 rejected. Note that, if the divorce plea in a case is not granted, the other pleas of this case will not be considered by the judge. Case materials are all natural language sentences, with averagely INLINEFORM7 tokens per fact description and INLINEFORM8 per plea. There are 62 relevant law articles in total, each with INLINEFORM9 tokens averagely. Note that the case documents include special typographical signals, making it easy to extract labeled data with regular expression.\nWe apply some rules with legal prior to preprocess the dataset according to previous works BIBREF33 , BIBREF34 , BIBREF35 , which have proved effective in our experiments.\nName Replacement: All names in case documents are replaced with marks indicating their roles, instead of simply anonymizing them, e.g. <Plantiff>, <Defendant>, <Daughter_x> and so on. Since \u201call are equal before the law\u201d, names should make no more difference than what role they take.\nLaw Article Filtration : Since most accessible divorce proceeding documents do not contain ground-truth fine-grained articles, we use an unsupervised method instead. First, we extract all the articles from the law text with regular expression. Afterwards, we select the most relevant 10 articles according to the fact descriptions as follows. We obtain sentence representation with CBOW BIBREF36 , BIBREF37 weighted by inverse document frequency, and calculate cosine distance between cases and law articles. Word embeddings are pre-trained with Chinese Wikipedia pages. As the final step, we extract top 5 relevant articles for each sample respectively from the main marriage law articles and their interpretations, which are equally important. We manually check the extracted articles for 100 cases to ensure that the extraction quality is fairly good and acceptable.\nThe filtration process is automatic and fully unsupervised since the original documents have no ground-truth labels for fine-grained law articles, and coarse-grained law-articles only provide limited information. We also experiment with the ground-truth articles, but only a small fraction of them has fine-grained ones, and they are usually not available in real-world scenarios.\nImplementation Details\nWe employ Jieba for Chinese word segmentation and keep the top INLINEFORM0 frequent words. The word embedding size is set to 128 and the other low-frequency words are replaced with the mark <UNK>. The hidden size of GRU is set to 128 for each direction in Bi-GRU. In the pair-wise attentive reader, the hidden state is set to 256 for mGRu. In the CNN layer, filter windows are set to 1, 3, 4, and 5 with each filter containing 200 feature maps. We add a dropout layer BIBREF38 after the CNN layer with a dropout rate of INLINEFORM1 . We use Adam BIBREF32 for training and set learning rate to INLINEFORM2 , INLINEFORM3 to INLINEFORM4 , INLINEFORM5 to INLINEFORM6 , INLINEFORM7 to INLINEFORM8 , batch size to 64. We employ precision, recall, F1 and accuracy for evaluation metrics. We repeat all the experiments for 10 times, and report the average results.\nBaselines\nFor comparison, we adopt and re-implement three kinds of baselines as follows:\nWe implement an SVM with lexical features in accordance with previous works BIBREF16 , BIBREF17 , BIBREF1 , BIBREF15 , BIBREF4 and select the best feature set on the development set.\nWe implement and fine-tune a series of neural text classifiers, including attention-based method BIBREF3 and other methods we deem important. CNN BIBREF18 and GRU BIBREF27 , BIBREF21 take as input the concatenation of fact description and plea. Similarly, CNN/GRU+law refers to using the concatenation of fact description, plea and law articles as inputs.\nWe implement and train some off-the-shelf RC models, including r-net BIBREF5 and AoA BIBREF6 , which are the leading models on SQuAD leaderboard. In our initial experiments, these models take fact description as passage and plea as query. Further, Law articles are added to the fact description as a part of the reading materials, which is a simple way to consider them as well.\nResults and Analysis\nFrom Table TABREF37 , we have the following observations:\n(1) AutoJudge consistently and significantly outperforms all the baselines, including RC models and other neural text classification models, which shows the effectiveness and robustness of our model.\n(2) RC models achieve better performance than most text classification models (excluding GRU+Attention), which indicates that reading mechanism is a better way to integrate information from heterogeneous yet complementary inputs. On the contrary, simply adding law articles as a part of the reading materials makes no difference in performance. Note that, GRU+Attention employ similar attention mechanism as RC does and takes additional law articles into consideration, thus achieves comparable performance with RC models.\n(3) Comparing with conventional RC models, AutoJudge achieves significant improvement with the consideration of additional law articles. It reflects the difference between LRC and conventional RC models. We re-formalize LRC in legal area to incorporate law articles via the reading mechanism, which can enhance judgment prediction. Moreover, CNN/GRU+law decrease the performance by simply concatenating original text with law articles, while GRU+Attention/AutoJudge increase the performance by integrating law articles with attention mechanism. It shows the importance and rationality of using attention mechanism to capture the interaction between multiple inputs.\nThe experiments support our hypothesis as proposed in the Introduction part that in civil cases, it's important to model the interactions among case materials. Reading mechanism can well perform the matching among them.\nAblation Test\nAutoJudge is characterized by the incorporation of pair-wise attentive reader, law articles, and a CNN output layer, as well as some pre-processing with legal prior. We design ablation tests respectively to evaluate the effectiveness of these modules. When taken off the attention mechanism, AutoJudge degrades into a GRU on which a CNN is stacked. When taken off law articles, the CNN output layer only takes INLINEFORM0 as input. Besides, our model is tested respectively without name-replacement or unsupervised selection of law articles (i.e. passing the whole law text). As mentioned above, we system use law articles extracted with unsupervised method, so we also experiment with ground-truth law articles.\nResults are shown in Table TABREF38 . We can infer that:\n(1) The performance drops significantly after removing the attention layer or excluding the law articles, which is consistent with the comparison between AutoJudge and baselines. The result verifies that both the reading mechanism and incorporation of law articles are important and effective.\n(2) After replacing CNN with an LSTM layer, performance drops as much as INLINEFORM0 in accuracy and INLINEFORM1 in F1 score. The reason may be the redundancy of RNNs. AutoJudge has employed several GRU layers to encode text sequences. Another RNN layer may be useless to capture sequential dependencies, while CNN can catch the local structure in convolution windows.\n(3) Motivated by existing rule-based works, we conduct data pre-processing on cases, including name replacement and law article filtration. If we remove the pre-processing operations, the performance drops considerably. It demonstrates that applying the prior knowledge in legal filed would benefit the understanding of legal cases.\nIt's intuitive that the quality of the retrieved law articles would affect the final performance. As is shown in Table TABREF38 , feeding the whole law text without filtration results in worse performance. However, when we train and evaluate our model with ground truth articles, the performance is boosted by nearly INLINEFORM0 in both F1 and Acc. The performance improvement is quite limited compared to that in previous work BIBREF3 for the following reasons: (1) As mentioned above, most case documents only contain coarse-grained articles, and only a small number of them contain fine-grained ones, which has limited information in themselves. (2) Unlike in criminal cases where the application of an article indicates the corresponding crime, law articles in civil cases work as reference, and can be applied in both the cases of supports and rejects. As law articles cut both ways for the judgment result, this is one of the characteristics that distinguishes civil cases from criminal ones. We also need to remember that, the performance of INLINEFORM1 in accuracy or INLINEFORM2 in F1 score is unattainable in real-world setting for automatic prediction where ground-truth articles are not available.\nIn the area of civil cases, the understanding of the case materials and how they interact is a critical factor. The inclusion of law articles is not enough. As is shown in Table TABREF38 , compared to feeding the model with an un-selected set of law articles, taking away the reading mechanism results in greater performance drop. Therefore, the ability to read, understand and select relevant information from the complex multi-sourced case materials is necessary. It's even more important in real world since we don't have access to ground-truth law articles to make predictions.\nCase Study\nWe visualize the heat maps of attention results. As shown in Fig. FIGREF47 , deeper background color represents larger attention score. The attention score is calculated with Eq. ( EQREF15 ). We take the average of the resulting INLINEFORM0 attention matrix over the time dimension to obtain attention values for each word.\nThe visualization demonstrates that the attention mechanism can capture relevant patterns and semantics in accordance with different pleas in different cases.\nAs for the failed samples, the most common reason comes from the anonymity issue, which is also shown in Fig. FIGREF47 . As mentioned above, we conduct name replacement. However, some critical elements are also anonymized by the government, due to the privacy issue. These elements are sometimes important to judgment prediction. For example, determination of the key factor long-time separation is relevant to the explicit dates, which are anonymized.\nConclusion\nIn this paper, we explore the task of predicting judgments of civil cases. Comparing with conventional text classification framework, we propose Legal Reading Comprehension framework to handle multiple and complex textual inputs. Moreover, we present a novel neural model, AutoJudge, to incorporate law articles for judgment prediction. In experiments, we compare our model on divorce proceedings with various state-of-the-art baselines of various frameworks. Experimental results show that our model achieves considerable improvement than all the baselines. Besides, visualization results also demonstrate the effectiveness and interpretability of our proposed model.\nIn the future, we can explore the following directions: (1) Limited by the datasets, we can only verify our proposed model on divorce proceedings. A more general and larger dataset will benefit the research on judgment prediction. (2) Judicial decisions in some civil cases are not always binary, but more diverse and flexible ones, e.g. compensation amount. Thus, it is critical for judgment prediction to manage various judgment forms.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What domains are covered in the corpus?", "What is the architecture of their model?", "How was the dataset collected?", "Which languages are part of the corpus?", "How is the quality of the data empirically evaluated? ", "Is the data in CoVoST annotated for dialect?", "Is Arabic one of the 11 languages in CoVost?"], "outputs": ["No specific domain is covered in the corpus.", "follow the architecture in berard2018end, but have 3 decoder layers like that in pino2019harnessing", "Contributors record voice clips by reading from a bank of donated sentences.", "French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese", "Validated transcripts were sent to professional translators., various sanity checks to the translations,  sanity check the overlaps of train, development and test sets", "No", "No"], "input": "Introduction\nEnd-to-end speech-to-text translation (ST) has attracted much attention recently BIBREF2, BIBREF3, BIBREF4, BIBREF5, BIBREF6 given its simplicity against cascading automatic speech recognition (ASR) and machine translation (MT) systems. The lack of labeled data, however, has become a major blocker for bridging the performance gaps between end-to-end models and cascading systems. Several corpora have been developed in recent years. post2013improved introduced a 38-hour Spanish-English ST corpus by augmenting the transcripts of the Fisher and Callhome corpora with English translations. di-gangi-etal-2019-must created the largest ST corpus to date from TED talks but the language pairs involved are out of English only. beilharz2019librivoxdeen created a 110-hour German-English ST corpus from LibriVox audiobooks. godard-etal-2018-low created a Moboshi-French ST corpus as part of a rare language documentation effort. woldeyohannis provided an Amharic-English ST corpus in the tourism domain. boito2019mass created a multilingual ST corpus involving 8 languages from a multilingual speech corpus based on Bible readings BIBREF7. Previous work either involves language pairs out of English, very specific domains, very low resource languages or a limited set of language pairs. This limits the scope of study, including the latest explorations on end-to-end multilingual ST BIBREF8, BIBREF9. Our work is mostly similar and concurrent to iranzosnchez2019europarlst who created a multilingual ST corpus from the European Parliament proceedings. The corpus we introduce has larger speech durations and more translation tokens. It is diversified with multiple speakers per transcript/translation. Finally, we provide additional out-of-domain test sets.\nIn this paper, we introduce CoVoST, a multilingual ST corpus based on Common Voice BIBREF10 for 11 languages into English, diversified with over 11,000 speakers and over 60 accents. It includes a total 708 hours of French (Fr), German (De), Dutch (Nl), Russian (Ru), Spanish (Es), Italian (It), Turkish (Tr), Persian (Fa), Swedish (Sv), Mongolian (Mn) and Chinese (Zh) speeches, with French and German ones having the largest durations among existing public corpora. We also collect an additional evaluation corpus from Tatoeba for French, German, Dutch, Russian and Spanish, resulting in a total of 9.3 hours of speech. Both corpora are created at the sentence level and do not require additional alignments or segmentation. Using the official Common Voice train-development-test split, we also provide baseline models, including, to our knowledge, the first end-to-end many-to-one multilingual ST models. CoVoST is released under CC0 license and free to use. The Tatoeba evaluation samples are also available under friendly CC licenses. All the data can be acquired at https://github.com/facebookresearch/covost.\nData Collection and Processing ::: Common Voice (CoVo)\nCommon Voice BIBREF10 is a crowdsourcing speech recognition corpus with an open CC0 license. Contributors record voice clips by reading from a bank of donated sentences. Each voice clip was validated by at least two other users. Most of the sentences are covered by multiple speakers, with potentially different genders, age groups or accents.\nRaw CoVo data contains samples that passed validation as well as those that did not. To build CoVoST, we only use the former one and reuse the official train-development-test partition of the validated data. As of January 2020, the latest CoVo 2019-06-12 release includes 29 languages. CoVoST is currently built on that release and covers the following 11 languages: French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese.\nValidated transcripts were sent to professional translators. Note that the translators had access to the transcripts but not the corresponding voice clips since clips would not carry additional information. Since transcripts were duplicated due to multiple speakers, we deduplicated the transcripts before sending them to translators. As a result, different voice clips of the same content (transcript) will have identical translations in CoVoST for train, development and test splits.\nIn order to control the quality of the professional translations, we applied various sanity checks to the translations BIBREF11. 1) For German-English, French-English and Russian-English translations, we computed sentence-level BLEU BIBREF12 with the NLTK BIBREF13 implementation between the human translations and the automatic translations produced by a state-of-the-art system BIBREF14 (the French-English system was a Transformer big BIBREF15 separately trained on WMT14). We applied this method to these three language pairs only as we are confident about the quality of the corresponding systems. Translations with a score that was too low were manually inspected and sent back to the translators when needed. 2) We manually inspected examples where the source transcript was identical to the translation. 3) We measured the perplexity of the translations using a language model trained on a large amount of clean monolingual data BIBREF14. We manually inspected examples where the translation had a high perplexity and sent them back to translators accordingly. 4) We computed the ratio of English characters in the translations. We manually inspected examples with a low ratio and sent them back to translators accordingly. 5) Finally, we used VizSeq BIBREF16 to calculate similarity scores between transcripts and translations based on LASER cross-lingual sentence embeddings BIBREF17. Samples with low scores were manually inspected and sent back for translation when needed.\nWe also sanity check the overlaps of train, development and test sets in terms of transcripts and voice clips (via MD5 file hashing), and confirm they are totally disjoint.\nData Collection and Processing ::: Tatoeba (TT)\nTatoeba (TT) is a community built language learning corpus having sentences aligned across multiple languages with the corresponding speech partially available. Its sentences are on average shorter than those in CoVoST (see also Table TABREF2) given the original purpose of language learning. Sentences in TT are licensed under CC BY 2.0 FR and part of the speeches are available under various CC licenses.\nWe construct an evaluation set from TT (for French, German, Dutch, Russian and Spanish) as a complement to CoVoST development and test sets. We collect (speech, transcript, English translation) triplets for the 5 languages and do not include those whose speech has a broken URL or is not CC licensed. We further filter these samples by sentence lengths (minimum 4 words including punctuations) to reduce the portion of short sentences. This makes the resulting evaluation set closer to real-world scenarios and more challenging.\nWe run the same quality checks for TT as for CoVoST but we do not find poor quality translations according to our criteria. Finally, we report the overlap between CoVo transcripts and TT sentences in Table TABREF5. We found a minimal overlap, which makes the TT evaluation set a suitable additional test set when training on CoVoST.\nData Analysis ::: Basic Statistics\nBasic statistics for CoVoST and TT are listed in Table TABREF2 including (unique) sentence counts, speech durations, speaker demographics (partially available) as well as vocabulary and token statistics (based on Moses-tokenized sentences by sacreMoses) on both transcripts and translations. We see that CoVoST has over 327 hours of German speeches and over 171 hours of French speeches, which, to our knowledge, corresponds to the largest corpus among existing public ST corpora (the second largest is 110 hours BIBREF18 for German and 38 hours BIBREF19 for French). Moreover, CoVoST has a total of 18 hours of Dutch speeches, to our knowledge, contributing the first public Dutch ST resource. CoVoST also has around 27-hour Russian speeches, 37-hour Italian speeches and 67-hour Persian speeches, which is 1.8 times, 2.5 times and 13.3 times of the previous largest public one BIBREF7. Most of the sentences (transcripts) in CoVoST are covered by multiple speakers with potentially different accents, resulting in a rich diversity in the speeches. For example, there are over 1,000 speakers and over 10 accents in the French and German development / test sets. This enables good coverage of speech variations in both model training and evaluation.\nData Analysis ::: Speaker Diversity\nAs we can see from Table TABREF2, CoVoST is diversified with a rich set of speakers and accents. We further inspect the speaker demographics in terms of sample distributions with respect to speaker counts, accent counts and age groups, which is shown in Figure FIGREF6, FIGREF7 and FIGREF8. We observe that for 8 of the 11 languages, at least 60% of the sentences (transcripts) are covered by multiple speakers. Over 80% of the French sentences have at least 3 speakers. And for German sentences, even over 90% of them have at least 5 speakers. Similarly, we see that a large portion of sentences are spoken in multiple accents for French, German, Dutch and Spanish. Speakers of each language also spread widely across different age groups (below 20, 20s, 30s, 40s, 50s, 60s and 70s).\nBaseline Results\nWe provide baselines using the official train-development-test split on the following tasks: automatic speech recognition (ASR), machine translation (MT) and speech translation (ST).\nBaseline Results ::: Experimental Settings ::: Data Preprocessing\nWe convert raw MP3 audio files from CoVo and TT into mono-channel waveforms, and downsample them to 16,000 Hz. For transcripts and translations, we normalize the punctuation, we tokenize the text with sacreMoses and lowercase it. For transcripts, we further remove all punctuation markers except for apostrophes. We use character vocabularies on all the tasks, with 100% coverage of all the characters. Preliminary experimentation showed that character vocabularies provided more stable training than BPE. For MT, the vocabulary is created jointly on both transcripts and translations. We extract 80-channel log-mel filterbank features, computed with a 25ms window size and 10ms window shift using torchaudio. The features are normalized to 0 mean and 1.0 standard deviation. We remove samples having more than 3,000 frames or more than 256 characters for GPU memory efficiency (less than 25 samples are removed for all languages).\nBaseline Results ::: Experimental Settings ::: Model Training\nOur ASR and ST models follow the architecture in berard2018end, but have 3 decoder layers like that in pino2019harnessing. For MT, we use a Transformer base architecture BIBREF15, but with 3 encoder layers, 3 decoder layers and 0.3 dropout. We use a batch size of 10,000 frames for ASR and ST, and a batch size of 4,000 tokens for MT. We train all models using Fairseq BIBREF20 for up to 200,000 updates. We use SpecAugment BIBREF21 for ASR and ST to alleviate overfitting.\nBaseline Results ::: Experimental Settings ::: Inference and Evaluation\nWe use a beam size of 5 for all models. We use the best checkpoint by validation loss for MT, and average the last 5 checkpoints for ASR and ST. For MT and ST, we report case-insensitive tokenized BLEU BIBREF22 using sacreBLEU BIBREF23. For ASR, we report word error rate (WER) and character error rate (CER) using VizSeq.\nBaseline Results ::: Automatic Speech Recognition (ASR)\nFor simplicity, we use the same model architecture for ASR and ST, although we do not leverage ASR models to pretrain ST model encoders later. Table TABREF18 shows the word error rate (WER) and character error rate (CER) for ASR models. We see that French and German perform the best given they are the two highest resource languages in CoVoST. The other languages are relatively low resource (especially Turkish and Swedish) and the ASR models are having difficulties to learn from this data.\nBaseline Results ::: Machine Translation (MT)\nMT models take transcripts (without punctuation) as inputs and outputs translations (with punctuation). For simplicity, we do not change the text preprocessing methods for MT to correct this mismatch. Moreover, this mismatch also exists in cascading ST systems, where MT model inputs are the outputs of an ASR model. Table TABREF20 shows the BLEU scores of MT models. We notice that the results are consistent with what we see from ASR models. For example thanks to abundant training data, French has a decent BLEU score of 29.8/25.4. German doesn't perform well, because of less richness of content (transcripts). The other languages are low resource in CoVoST and it is difficult to train decent models without additional data or pre-training techniques.\nBaseline Results ::: Speech Translation (ST)\nCoVoST is a many-to-one multilingual ST corpus. While end-to-end one-to-many and many-to-many multilingual ST models have been explored very recently BIBREF8, BIBREF9, many-to-one multilingual models, to our knowledge, have not. We hence use CoVoST to examine this setting. Table TABREF22 and TABREF23 show the BLEU scores for both bilingual and multilingual end-to-end ST models trained on CoVoST. We observe that combining speeches from multiple languages is consistently bringing gains to low-resource languages (all besides French and German). This includes combinations of distant languages, such as Ru+Fr, Tr+Fr and Zh+Fr. Moreover, some combinations do bring gains to high-resource language (French) as well: Es+Fr, Tr+Fr and Mn+Fr. We simply provide the most basic many-to-one multilingual baselines here, and leave the full exploration of the best configurations to future work. Finally, we note that for some language pairs, absolute BLEU numbers are relatively low as we restrict model training to the supervised data. We encourage the community to improve upon those baselines, for example by leveraging semi-supervised training.\nBaseline Results ::: Multi-Speaker Evaluation\nIn CoVoST, large portion of transcripts are covered by multiple speakers with different genders, accents and age groups. Besides the standard corpus-level BLEU scores, we also want to evaluate model output variance on the same content (transcript) but different speakers. We hence propose to group samples (and their sentence BLEU scores) by transcript, and then calculate average per-group mean and average coefficient of variation defined as follows:\nand\nwhere $G$ is the set of sentence BLEU scores grouped by transcript and $G^{\\prime } = \\lbrace g | g\\in G, |g|>1, \\textrm {Mean}(g) > 0 \\rbrace $.\n$\\textrm {BLEU}_{MS}$ provides a normalized quality score as oppose to corpus-level BLEU or unnormalized average of sentence BLEU. And $\\textrm {CoefVar}_{MS}$ is a standardized measure of model stability against different speakers (the lower the better). Table TABREF24 shows the $\\textrm {BLEU}_{MS}$ and $\\textrm {CoefVar}_{MS}$ of our ST models on CoVoST test set. We see that German and Persian have the worst $\\textrm {CoefVar}_{MS}$ (least stable) given their rich speaker diversity in the test set and relatively small train set (see also Figure FIGREF6 and Table TABREF2). Dutch also has poor $\\textrm {CoefVar}_{MS}$ because of the lack of training data. Multilingual models are consistantly more stable on low-resource languages. Ru+Fr, Tr+Fr, Fa+Fr and Zh+Fr even have better $\\textrm {CoefVar}_{MS}$ than all individual languages.\nConclusion\nWe introduce a multilingual speech-to-text translation corpus, CoVoST, for 11 languages into English, diversified with over 11,000 speakers and over 60 accents. We also provide baseline results, including, to our knowledge, the first end-to-end many-to-one multilingual model for spoken language translation. CoVoST is free to use with a CC0 license, and the additional Tatoeba evaluation samples are also CC-licensed.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What kind of model do they use?", "What kind of model do they use?", "Did they release their data set of academic papers?", "Do the methods that work best on academic papers also work best on Wikipedia?", "What is their system's absolute accuracy?", "Which is more useful, visual or textual features?", "Which languages do they use?", "How large is their data set?", "Where do they get their ground truth quality judgments?"], "outputs": ["visual model is based on fine-tuning an Inception V3 model BIBREF1 over visual renderings of documents, while our textual model is based on a hierarchical biLSTM. We further combine the two into a joint model. , neural network models", "Inception V3, biLSTM", "No", "No", "59.4% on wikipedia dataset, 93.4% on peer-reviewed archive AI papers, 77.1%  on peer-reviewed archive Computation and Language papers, and 79.9% on peer-reviewed archive Machine Learning papers", "It depends on the dataset. Experimental results over two datasets reveal that textual and visual features are complementary. ", "English", "a sample of  29,794 wikipedia articles and 2,794 arXiv papers ", "quality class labels assigned by the Wikipedia community, a paper is considered to have been accepted (i.e. is positively labeled) if it matches a paper in the DBLP database or is otherwise accepted by any of the following conferences: ACL, EMNLP, NAACL, EACL, TACL, NIPS, ICML, ICLR, or AAAI"], "input": "Introduction\nThe task of document quality assessment is to automatically assess a document according to some predefined inventory of quality labels. This can take many forms, including essay scoring (quality = language quality, coherence, and relevance to a topic), job application filtering (quality = suitability for role + visual/presentational quality of the application), or answer selection in community question answering (quality = actionability + relevance of the answer to the question). In the case of this paper, we focus on document quality assessment in two contexts: Wikipedia document quality classification, and whether a paper submitted to a conference was accepted or not.\nAutomatic quality assessment has obvious benefits in terms of time savings and tractability in contexts where the volume of documents is large. In the case of dynamic documents (possibly with multiple authors), such as in the case of Wikipedia, it is particularly pertinent, as any edit potentially has implications for the quality label of that document (and around 10 English Wikipedia documents are edited per second). Furthermore, when the quality assessment task is decentralized (as in the case of Wikipedia and academic paper assessment), quality criteria are often applied inconsistently by different people, where an automatic document quality assessment system could potentially reduce inconsistencies and enable immediate author feedback.\nCurrent studies on document quality assessment mainly focus on textual features. For example, BIBREF0 examine features such as the article length and the number of headings to predict the quality class of a Wikipedia article. In contrast to these studies, in this paper, we propose to combine text features with visual features, based on a visual rendering of the document. Figure 1 illustrates our intuition, relative to Wikipedia articles. Without being able to read the text, we can tell that the article in Figure 1 has higher quality than Figure 1 , as it has a detailed infobox, extensive references, and a variety of images. Based on this intuition, we aim to answer the following question: can we achieve better accuracy on document quality assessment by complementing textual features with visual features?\nOur visual model is based on fine-tuning an Inception V3 model BIBREF1 over visual renderings of documents, while our textual model is based on a hierarchical biLSTM. We further combine the two into a joint model. We perform experiments on two datasets: a Wikipedia dataset novel to this paper, and an arXiv dataset provided by BIBREF2 split into three sub-parts based on subject category. Experimental results on the visual renderings of documents show that implicit quality indicators, such as images and visual layout, can be captured by an image classifier, at a level comparable to a text classifier. When we combine the two models, we achieve state-of-the-art results over 3/4 of our datasets.\nThis paper makes the following contributions:\nAll code and data associated with this research will be released on publication.\nRelated Work\nA variety of approaches have been proposed for document quality assessment across different domains: Wikipedia article quality assessment, academic paper rating, content quality assessment in community question answering (cQA), and essay scoring. Among these approaches, some use hand-crafted features while others use neural networks to learn features from documents. For each domain, we first briefly describe feature-based approaches and then review neural network-based approaches. Wikipedia article quality assessment: Quality assessment of Wikipedia articles is a task that assigns a quality class label to a given Wikipedia article, mirroring the quality assessment process that the Wikipedia community carries out manually. Many approaches have been proposed that use features from the article itself, meta-data features (e.g., the editors, and Wikipedia article revision history), or a combination of the two. Article-internal features capture information such as whether an article is properly organized, with supporting evidence, and with appropriate terminology. For example, BIBREF3 use writing styles represented by binarized character trigram features to identify featured articles. BIBREF4 and BIBREF0 explore the number of headings, images, and references in the article. BIBREF5 use nine readability scores, such as the percentage of difficult words in the document, to measure the quality of the article. Meta-data features, which are indirect indicators of article quality, are usually extracted from revision history, and the interaction between editors and articles. For example, one heuristic that has been proposed is that higher-quality articles have more edits BIBREF6 , BIBREF7 . BIBREF8 use the percentage of registered editors and the total number of editors of an article. Article\u2013editor dependencies have also been explored. For example, BIBREF9 use the authority of editors to measure the quality of Wikipedia articles, where the authority of editors is determined by the articles they edit. Deep learning approaches to predicting Wikipedia article quality have also been proposed. For example, BIBREF10 use a version of doc2vec BIBREF11 to represent articles, and feed the document embeddings into a four hidden layer neural network. BIBREF12 first obtain sentence representations by averaging words within a sentence, and then apply a biLSTM BIBREF13 to learn a document-level representation, which is combined with hand-crafted features as side information. BIBREF14 exploit two stacked biLSTMs to learn document representations.\nAcademic paper rating: Academic paper rating is a relatively new task in NLP/AI, with the basic formulation being to automatically predict whether to accept or reject a paper. BIBREF2 explore hand-crafted features, such as the length of the title, whether specific words (such as outperform, state-of-the-art, and novel) appear in the abstract, and an embedded representation of the abstract as input to different downstream learners, such as logistic regression, decision tree, and random forest. BIBREF15 exploit a modularized hierarchical convolutional neural network (CNN), where each paper section is treated as a module. For each paper section, they train an attention-based CNN, and an attentive pooling layer is applied to the concatenated representation of each section, which is then fed into a softmax layer.\nContent quality assessment in cQA: Automatic quality assessment in cQA is the task of determining whether an answer is of high quality, selected as the best answer, or ranked higher than other answers. To measure answer content quality in cQA, researchers have exploited various features from different sources, such as the answer content itself, the answerer's profile, interactions among users, and usage of the content. The most common feature used is the answer length BIBREF16 , BIBREF17 , with other features including: syntactic and semantic features, such as readability scores. BIBREF18 ; similarity between the question and the answer at lexical, syntactic, and semantic levels BIBREF18 , BIBREF19 , BIBREF20 ; or user data (e.g., a user's status points or the number of answers written by the user). There have also been approaches using neural networks. For example, BIBREF21 combine CNN-learned representations with hand-crafted features to predict answer quality. BIBREF22 use a 2-dimensional CNN to learn the semantic relevance of an answer to the question, and apply an LSTM to the answer sequence to model thread context. BIBREF23 and BIBREF24 model the problem similarly to machine translation quality estimation, treating answers as competing translation hypotheses and the question as the reference translation, and apply neural machine translation to the problem. Essay scoring: Automated essay scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics. To measure whether an essay is relevant to its \u201cprompt\u201d (the description of the essay topic), lexical and semantic overlap is commonly used BIBREF25 , BIBREF26 . BIBREF27 explore word features, such as the number of verb formation errors, average word frequency, and average word length, to measure word usage and lexical complexity. BIBREF28 use sentence structure features to measure sentence variety. The effects of grammatical and mechanic errors on the quality of an essay are measured via word and part-of-speech $n$ -gram features and \u201cmechanics\u201d features BIBREF29 (e.g., spelling, capitalization, and punctuation), respectively. BIBREF30 , BIBREF31 , and BIBREF32 use an LSTM to obtain an essay representation, which is used as the basis for classification. Similarly, BIBREF33 utilize a CNN to obtain sentence representation and an LSTM to obtain essay representation, with an attention layer at both the sentence and essay levels.\nThe Proposed Joint Model\nWe treat document quality assessment as a classification problem, i.e., given a document, we predict its quality class (e.g., whether an academic paper should be accepted or rejected). The proposed model is a joint model that integrates visual features learned through Inception V3 with textual features learned through a biLSTM. In this section, we present the details of the visual and textual embeddings, and finally describe how we combine the two. We return to discuss hyper-parameter settings and the experimental configuration in the Experiments section.\nVisual Embedding Learning\nA wide range of models have been proposed to tackle the image classification task, such as VGG BIBREF34 , ResNet BIBREF35 , Inception V3 BIBREF1 , and Xception BIBREF36 . However, to the best of our knowledge, there is no existing work that has proposed to use visual renderings of documents to assess document quality. In this paper, we use Inception V3 pretrained on ImageNet (\u201cInception\u201d hereafter) to obtain visual embeddings of documents, noting that any image classifier could be applied to our task. The input to Inception is a visual rendering (screenshot) of a document, and the output is a visual embedding, which we will later integrate with our textual embedding.\nBased on the observation that it is difficult to decide what types of convolution to apply to each layer (such as 3 $\\times $ 3 or 5 $\\times $ 5), the basic Inception model applies multiple convolution filters in parallel and concatenates the resulting features, which are fed into the next layer. This has the benefit of capturing both local features through smaller convolutions and abstracted features through larger convolutions. Inception is a hybrid of multiple Inception models of different architectures. To reduce computational cost, Inception also modifies the basic model by applying a 1 $\\times $ 1 convolution to the input and factorizing larger convolutions into smaller ones.\nTextual Embedding Learning\nWe adopt a bi-directional LSTM model to generate textual embeddings for document quality assessment, following the method of BIBREF12 (\u201cbiLSTM\u201d hereafter). The input to biLSTM is a textual document, and the output is a textual embedding, which will later integrate with the visual embedding.\nFor biLSTM, each word is represented as a word embedding BIBREF37 , and an average-pooling layer is applied to the word embeddings to obtain the sentence embedding, which is fed into a bi-directional LSTM to generate the document embedding from the sentence embeddings. Then a max-pooling layer is applied to select the most salient features from the component sentences.\nThe Joint Model\nThe proposed joint model (\u201cJoint\u201d hereafter) combines the visual and textual embeddings (output of Inception and biLSTM) via a simple feed-forward layer and softmax over the document label set, as shown in Figure 2 . We optimize our model based on cross-entropy loss.\nExperiments\nIn this section, we first describe the two datasets used in our experiments: (1) Wikipedia, and (2) arXiv. Then, we report the experimental details and results.\nDatasets\nThe Wikipedia dataset consists of articles from English Wikipedia, with quality class labels assigned by the Wikipedia community. Wikipedia articles are labelled with one of six quality classes, in descending order of quality: Featured Article (\u201cFA\u201d), Good Article (\u201cGA\u201d), B-class Article (\u201cB\u201d), C-class Article (\u201cC\u201d), Start Article (\u201cStart\u201d), and Stub Article (\u201cStub\u201d). A description of the criteria associated with the different classes can be found in the Wikipedia grading scheme page. The quality class of a Wikipedia article is assigned by Wikipedia reviewers or any registered user, who can discuss through the article's talk page to reach consensus. We constructed the dataset by first crawling all articles from each quality class repository, e.g., we get FA articles by crawling pages from the FA repository: https://en.wikipedia.org/wiki/Category:Featured_articles. This resulted in around 5K FA, 28K GA, 212K B, 533K C, 2.6M Start, and 3.2M Stub articles.\nWe randomly sampled 5,000 articles from each quality class and removed all redirect pages, resulting in a dataset of 29,794 articles. As the wikitext contained in each document contains markup relating to the document category such as {Featured Article} or {geo-stub}, which reveals the label, we remove such information. We additionally randomly partitioned this dataset into training, development, and test splits based on a ratio of 8:1:1. Details of the dataset are summarized in Table 1 .\nWe generate a visual representation of each document via a 1,000 $\\times $ 2,000-pixel screenshot of the article via a PhantomJS script over the rendered version of the article, ensuring that the screenshot and wikitext versions of the article are the same version. Any direct indicators of document quality (such as the FA indicator, which is a bronze star icon in the top right corner of the webpage) are removed from the screenshot.\nThe arXiv dataset BIBREF2 consists of three subsets of academic articles under the arXiv repository of Computer Science (cs), from the three subject areas of: Artificial Intelligence (cs.ai), Computation and Language (cs.cl), and Machine Learning (cs.lg). In line with the original dataset formulation BIBREF2 , a paper is considered to have been accepted (i.e. is positively labeled) if it matches a paper in the DBLP database or is otherwise accepted by any of the following conferences: ACL, EMNLP, NAACL, EACL, TACL, NIPS, ICML, ICLR, or AAAI. Failing this, it is considered to be rejected (noting that some of the papers may not have been submitted to one of these conferences). The median numbers of pages for papers in cs.ai, cs.cl, and cs.lg are 11, 10, and 12, respectively. To make sure each page in the PDF file has the same size in the screenshot, we crop the PDF file of a paper to the first 12; we pad the PDF file with blank pages if a PDF file has less than 12 pages, using the PyPDF2 Python package. We then use ImageMagick to convert the 12-page PDF file to a single 1,000 $\\times $ 2,000 pixel screenshot. Table 2 details this dataset, where the \u201cAccepted\u201d column denotes the percentage of positive instances (accepted papers) in each subset.\nExperimental Setting\nAs discussed above, our model has two main components \u2014 biLSTM and Inception\u2014 which generate textual and visual representations, respectively. For the biLSTM component, the documents are preprocessed as described in BIBREF12 , where an article is divided into sentences and tokenized using NLTK BIBREF38 . Words appearing more than 20 times are retained when building the vocabulary. All other words are replaced by the special UNK token. We use the pre-trained GloVe BIBREF39 50-dimensional word embeddings to represent words. For words not in GloVe, word embeddings are randomly initialized based on sampling from a uniform distribution $U(-1, 1)$ . All word embeddings are updated in the training process. We set the LSTM hidden layer size to 256. The concatenation of the forward and backward LSTMs thus gives us 512 dimensions for the document embedding. A dropout layer is applied at the sentence and document level, respectively, with a probability of 0.5.\nFor Inception, we adopt data augmentation techniques in the training with a \u201cnearest\u201d filling mode, a zoom range of 0.1, a width shift range of 0.1, and a height shift range of 0.1. As the original screenshots have the size of 1,000 $\\times 2$ ,000 pixels, they are resized to 500 $\\times $ 500 to feed into Inception, where the input shape is (500, 500, 3). A dropout layer is applied with a probability of 0.5. Then, a GlobalAveragePooling2D layer is applied, which produces a 2,048 dimensional representation.\nFor the Joint model, we get a representation of 2,560 dimensions by concatenating the 512 dimensional representation from the biLSTM with the 2,048 dimensional representation from Inception. The dropout layer is applied to the two components with a probability of 0.5. For biLSTM, we use a mini-batch size of 128 and a learning rate of 0.001. For both Inception and joint model, we use a mini-batch size of 16 and a learning rate of 0.0001. All hyper-parameters were set empirically over the development data, and the models were optimized using the Adam optimizer BIBREF40 .\nIn the training phase, the weights in Inception are initialized by parameters pretrained on ImageNet, and the weights in biLSTM are randomly initialized (except for the word embeddings). We train each model for 50 epochs. However, to prevent overfitting, we adopt early stopping, where we stop training the model if the performance on the development set does not improve for 20 epochs. For evaluation, we use (micro-)accuracy, following previous studies BIBREF5 , BIBREF2 .\nBaseline Approaches\nWe compare our models against the following five baselines:\nMajority: the model labels all test samples with the majority class of the training data.\nBenchmark: a benchmark method from the literature. In the case of Wikipedia, this is BIBREF5 , who use structural features and readability scores as features to build a random forest classifier; for arXiv, this is BIBREF2 , who use hand-crafted features, such as the number of references and TF-IDF weighted bag-of-words in abstract, to build a classifier based on the best of logistic regression, multi-layer perception, and AdaBoost.\nDoc2Vec: doc2vec BIBREF11 to learn document embeddings with a dimension of 500, and a 4-layer feed-forward classification model on top of this, with 2000, 1000, 500, and 200 dimensions, respectively.\nbiLSTM: first derive a sentence representation by averaging across words in a sentence, then feed the sentence representation into a biLSTM and a maxpooling layer over output sequence to learn a document level representation with a dimension of 512, which is used to predict document quality.\nInception $_{\\text{fixed}}$ : the frozen Inception model, where only parameters in the last layer are fine-tuned during training.\nThe hyper-parameters of Benchmark, Doc2Vec, and biLSTM are based on the corresponding papers except that: (1) we fine-tune the feed forward layer of Doc2Vec on the development set and train the model 300 epochs on Wikipedia and 50 epochs on arXiv; (2) we do not use hand-crafted features for biLSTM as we want the baselines to be comparable to our models, and the main focus of this paper is not to explore the effects of hand-crafted features (e.g., see BIBREF12 ).\nExperimental Results\nTable 3 shows the performance of the different models over our two datasets, in the form of the average accuracy on the test set (along with the standard deviation) over 10 runs, with different random initializations.\nOn Wikipedia, we observe that the performance of biLSTM, Inception, and Joint is much better than that of all four baselines. Inception achieves 2.9% higher accuracy than biLSTM. The performance of Joint achieves an accuracy of 59.4%, which is 5.3% higher than using textual features alone (biLSTM) and 2.4% higher than using visual features alone (Inception). Based on a one-tailed Wilcoxon signed-rank test, the performance of Joint is statistically significant ( $p<0.05$ ). This shows that the textual and visual features complement each other, achieving state-of-the-art results in combination.\nFor arXiv, baseline methods Majority, Benchmark, and Inception $_{\\text{fixed}}$ outperform biLSTM over cs.ai, in large part because of the class imbalance in this dataset (90% of papers are rejected). Surprisingly, Inception $_{\\text{fixed}}$ is better than Majority and Benchmark over the arXiv cs.lg subset, which verifies the usefulness of visual features, even when only the last layer is fine-tuned. Table 3 also shows that Inception and biLSTM achieve similar performance on arXiv, showing that textual and visual representations are equally discriminative: Inception and biLSTM are indistinguishable over cs.cl; biLSTM achieves 1.8% higher accuracy over cs.lg, while Inception achieves 1.3% higher accuracy over cs.ai. Once again, the Joint model achieves the highest accuracy on cs.ai and cs.cl by combining textual and visual representations (at a level of statistical significance for cs.ai). This, again, confirms that textual and visual features complement each other, and together they achieve state-of-the-art results. On arXiv cs.lg, Joint achieves a 0.6% higher accuracy than Inception by combining visual features and textual features, but biLSTM achieves the highest accuracy. One characteristic of cs.lg documents is that they tend to contain more equations than the other two arXiv datasets, and preliminary analysis suggests that the biLSTM is picking up on a correlation between the volume/style of mathematical presentation and the quality of the document.\nAnalysis\nIn this section, we first analyze the performance of Inception and Joint. We also analyze the performance of different models on different quality classes. The high-level representations learned by different models are also visualized and discussed. As the Wikipedia test set is larger and more balanced than that of arXiv, our analysis will focus on Wikipedia.\nInception\nTo better understand the performance of Inception, we generated the gradient-based class activation map BIBREF41 , by maximizing the outputs of each class in the penultimate layer, as shown in Figure 3 . From Figure 3 and Figure 3 , we can see that Inception identifies the two most important regions (one at the top corresponding to the table of contents, and the other at the bottom, capturing both document length and references) that contribute to the FA class prediction, and a region in the upper half of the image that contributes to the GA class prediction (capturing the length of the article body). From Figure 3 and Figure 3 , we can see that the most important regions in terms of B and C class prediction capture images (down the left and right of the page, in the case of B and C), and document length/references. From Figure 3 and Figure 3 , we can see that Inception finds that images in the top right corner are the strongest predictor of Start class prediction, and (the lack of) images/the link bar down the left side of the document are the most important for Stub class prediction.\nJoint\nTable 4 shows the confusion matrix of Joint on Wikipedia. We can see that more than 50% of documents for each quality class are correctly classified, except for the C class where more documents are misclassified into B. Analysis shows that when misclassified, documents are usually misclassified into adjacent quality classes, which can be explained by the Wikipedia grading scheme, where the criteria for adjacent quality classes are more similar.\nWe also provide a breakdown of precision (\u201c $\\mathcal {P}$ \u201d), recall (\u201c $\\mathcal {R}$ \u201d), and F1 score (\u201c $\\mathcal {F}_{\\beta =1}$ \u201d) for biLSTM, Inception, and Joint across the quality classes in Table 5 . We can see that Joint achieves the highest accuracy in 11 out of 18 cases. It is also worth noting that all models achieve higher scores for FA, GA, and Stub articles than B, C and Start articles. This can be explained in part by the fact that FA and GA articles must pass an official review based on structured criteria, and in part by the fact that Stub articles are usually very short, which is discriminative for Inception, and Joint. All models perform worst on the B and C quality classes. It is difficult to differentiate B articles from C articles even for Wikipedia contributors. As evidence of this, when we crawled a new dataset including talk pages with quality class votes from Wikipedia contributors, we found that among articles with three or more quality labels, over 20% percent of B and C articles have inconsistent votes from Wikipedia contributors, whereas for FA and GA articles the number is only 0.7%.\nWe further visualize the learned document representations of biLSTM, Inception, and Joint in the form of a t-SNE plot BIBREF42 in Figure 4 . The degree of separation between Start and Stub achieved by Inception is much greater than for biLSTM, with the separation between Start and Stub achieved by Joint being the clearest among the three models. Inception and Joint are better than biLSTM at separating Start and C. Joint achieves slightly better performance than Inception in separating GA and FA. We can also see that it is difficult for all models to separate B and C, which is consistent with the findings of Tables 4 and 5 .\nConclusions\nWe proposed to use visual renderings of documents to capture implicit document quality indicators, such as font choices, images, and visual layout, which are not captured in textual content. We applied neural network models to capture visual features given visual renderings of documents. Experimental results show that we achieve a 2.9% higher accuracy than state-of-the-art approaches based on textual features over Wikipedia, and performance competitive with or surpassing state-of-the-art approaches over arXiv. We further proposed a joint model, combining textual and visual representations, to predict the quality of a document. Experimental results show that our joint model outperforms the visual-only model in all cases, and the text-only model on Wikipedia and two subsets of arXiv. These results underline the feasibility of assessing document quality via visual features, and the complementarity of visual and textual document representations for quality assessment.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What is the seed lexicon?", "What is the seed lexicon?", "What are the results?", "How are relations used to propagate polarity?", "How big is the Japanese data?", "What are labels available in dataset for supervision?", "How big are improvements of supervszed learning results trained on smalled labeled data enhanced with proposed approach copared to basic approach?", "How does their model learn using mostly raw data?", "How big is seed lexicon used for training?", "How large is raw corpus used for training?"], "outputs": ["a vocabulary of positive and negative predicates that helps determine the polarity score of an event", "seed lexicon consists of positive and negative predicates", "Using all data to train: AL -- BiGRU achieved 0.843 accuracy, AL -- BERT achieved 0.863 accuracy, AL+CA+CO -- BiGRU achieved 0.866 accuracy, AL+CA+CO -- BERT achieved 0.835, accuracy, ACP -- BiGRU achieved 0.919 accuracy, ACP -- BERT achived 0.933, accuracy, ACP+AL+CA+CO -- BiGRU achieved 0.917 accuracy, ACP+AL+CA+CO -- BERT achieved 0.913 accuracy. \nUsing a subset to train: BERT achieved 0.876 accuracy using ACP (6K), BERT achieved 0.886 accuracy using ACP (6K) + AL, BiGRU achieved 0.830 accuracy using ACP (6K), BiGRU achieved 0.879 accuracy using ACP (6K) + AL + CA + CO.", "cause relation: both events in the relation should have the same polarity; concession relation: events should have opposite polarity", "The ACP corpus has around 700k events split into positive and negative polarity ", "negative, positive", "3%", "by exploiting discourse relations to propagate polarity from seed predicates to final sentiment polarity", "30 words", "100 million sentences"], "input": "Introduction\nAffective events BIBREF0 are events that typically affect people in positive or negative ways. For example, getting money and playing sports are usually positive to the experiencers; catching cold and losing one's wallet are negative. Understanding affective events is important to various natural language processing (NLP) applications such as dialogue systems BIBREF1, question-answering systems BIBREF2, and humor recognition BIBREF3. In this paper, we work on recognizing the polarity of an affective event that is represented by a score ranging from $-1$ (negative) to 1 (positive).\nLearning affective events is challenging because, as the examples above suggest, the polarity of an event is not necessarily predictable from its constituent words. Combined with the unbounded combinatorial nature of language, the non-compositionality of affective polarity entails the need for large amounts of world knowledge, which can hardly be learned from small annotated data.\nIn this paper, we propose a simple and effective method for learning affective events that only requires a very small seed lexicon and a large raw corpus. As illustrated in Figure FIGREF1, our key idea is that we can exploit discourse relations BIBREF4 to efficiently propagate polarity from seed predicates that directly report one's emotions (e.g., \u201cto be glad\u201d is positive). Suppose that events $x_1$ are $x_2$ are in the discourse relation of Cause (i.e., $x_1$ causes $x_2$). If the seed lexicon suggests $x_2$ is positive, $x_1$ is also likely to be positive because it triggers the positive emotion. The fact that $x_2$ is known to be negative indicates the negative polarity of $x_1$. Similarly, if $x_1$ and $x_2$ are in the discourse relation of Concession (i.e., $x_2$ in spite of $x_1$), the reverse of $x_2$'s polarity can be propagated to $x_1$. Even if $x_2$'s polarity is not known in advance, we can exploit the tendency of $x_1$ and $x_2$ to be of the same polarity (for Cause) or of the reverse polarity (for Concession) although the heuristic is not exempt from counterexamples. We transform this idea into objective functions and train neural network models that predict the polarity of a given event.\nWe trained the models using a Japanese web corpus. Given the minimum amount of supervision, they performed well. In addition, the combination of annotated and unannotated data yielded a gain over a purely supervised baseline when labeled data were small.\nRelated Work\nLearning affective events is closely related to sentiment analysis. Whereas sentiment analysis usually focuses on the polarity of what are described (e.g., movies), we work on how people are typically affected by events. In sentiment analysis, much attention has been paid to compositionality. Word-level polarity BIBREF5, BIBREF6, BIBREF7 and the roles of negation and intensification BIBREF8, BIBREF6, BIBREF9 are among the most important topics. In contrast, we are more interested in recognizing the sentiment polarity of an event that pertains to commonsense knowledge (e.g., getting money and catching cold).\nLabel propagation from seed instances is a common approach to inducing sentiment polarities. While BIBREF5 and BIBREF10 worked on word- and phrase-level polarities, BIBREF0 dealt with event-level polarities. BIBREF5 and BIBREF10 linked instances using co-occurrence information and/or phrase-level coordinations (e.g., \u201c$A$ and $B$\u201d and \u201c$A$ but $B$\u201d). We shift our scope to event pairs that are more complex than phrase pairs, and consequently exploit discourse connectives as event-level counterparts of phrase-level conjunctions.\nBIBREF0 constructed a network of events using word embedding-derived similarities. Compared with this method, our discourse relation-based linking of events is much simpler and more intuitive.\nSome previous studies made use of document structure to understand the sentiment. BIBREF11 proposed a sentiment-specific pre-training strategy using unlabeled dialog data (tweet-reply pairs). BIBREF12 proposed a method of building a polarity-tagged corpus (ACP Corpus). They automatically gathered sentences that had positive or negative opinions utilizing HTML layout structures in addition to linguistic patterns. Our method depends only on raw texts and thus has wider applicability.\nProposed Method\nProposed Method ::: Polarity Function\nOur goal is to learn the polarity function $p(x)$, which predicts the sentiment polarity score of an event $x$. We approximate $p(x)$ by a neural network with the following form:\n${\\rm Encoder}$ outputs a vector representation of the event $x$. ${\\rm Linear}$ is a fully-connected layer and transforms the representation into a scalar. ${\\rm tanh}$ is the hyperbolic tangent and transforms the scalar into a score ranging from $-1$ to 1. In Section SECREF21, we consider two specific implementations of ${\\rm Encoder}$.\nProposed Method ::: Discourse Relation-Based Event Pairs\nOur method requires a very small seed lexicon and a large raw corpus. We assume that we can automatically extract discourse-tagged event pairs, $(x_{i1}, x_{i2})$ ($i=1, \\cdots $) from the raw corpus. We refer to $x_{i1}$ and $x_{i2}$ as former and latter events, respectively. As shown in Figure FIGREF1, we limit our scope to two discourse relations: Cause and Concession.\nThe seed lexicon consists of positive and negative predicates. If the predicate of an extracted event is in the seed lexicon and does not involve complex phenomena like negation, we assign the corresponding polarity score ($+1$ for positive events and $-1$ for negative events) to the event. We expect the model to automatically learn complex phenomena through label propagation. Based on the availability of scores and the types of discourse relations, we classify the extracted event pairs into the following three types.\nProposed Method ::: Discourse Relation-Based Event Pairs ::: AL (Automatically Labeled Pairs)\nThe seed lexicon matches (1) the latter event but (2) not the former event, and (3) their discourse relation type is Cause or Concession. If the discourse relation type is Cause, the former event is given the same score as the latter. Likewise, if the discourse relation type is Concession, the former event is given the opposite of the latter's score. They are used as reference scores during training.\nProposed Method ::: Discourse Relation-Based Event Pairs ::: CA (Cause Pairs)\nThe seed lexicon matches neither the former nor the latter event, and their discourse relation type is Cause. We assume the two events have the same polarities.\nProposed Method ::: Discourse Relation-Based Event Pairs ::: CO (Concession Pairs)\nThe seed lexicon matches neither the former nor the latter event, and their discourse relation type is Concession. We assume the two events have the reversed polarities.\nProposed Method ::: Loss Functions\nUsing AL, CA, and CO data, we optimize the parameters of the polarity function $p(x)$. We define a loss function for each of the three types of event pairs and sum up the multiple loss functions.\nWe use mean squared error to construct loss functions. For the AL data, the loss function is defined as:\nwhere $x_{i1}$ and $x_{i2}$ are the $i$-th pair of the AL data. $r_{i1}$ and $r_{i2}$ are the automatically-assigned scores of $x_{i1}$ and $x_{i2}$, respectively. $N_{\\rm AL}$ is the total number of AL pairs, and $\\lambda _{\\rm AL}$ is a hyperparameter.\nFor the CA data, the loss function is defined as:\n$y_{i1}$ and $y_{i2}$ are the $i$-th pair of the CA pairs. $N_{\\rm CA}$ is the total number of CA pairs. $\\lambda _{\\rm CA}$ and $\\mu $ are hyperparameters. The first term makes the scores of the two events closer while the second term prevents the scores from shrinking to zero.\nThe loss function for the CO data is defined analogously:\nThe difference is that the first term makes the scores of the two events distant from each other.\nExperiments\nExperiments ::: Dataset\nExperiments ::: Dataset ::: AL, CA, and CO\nAs a raw corpus, we used a Japanese web corpus that was compiled through the procedures proposed by BIBREF13. To extract event pairs tagged with discourse relations, we used the Japanese dependency parser KNP and in-house postprocessing scripts BIBREF14. KNP used hand-written rules to segment each sentence into what we conventionally called clauses (mostly consecutive text chunks), each of which contained one main predicate. KNP also identified the discourse relations of event pairs if explicit discourse connectives BIBREF4 such as \u201c\u306e\u3067\u201d (because) and \u201c\u306e\u306b\u201d (in spite of) were present. We treated Cause/Reason (\u539f\u56e0\u30fb\u7406\u7531) and Condition (\u6761\u4ef6) in the original tagset BIBREF15 as Cause and Concession (\u9006\u63a5) as Concession, respectively. Here is an example of event pair extraction.\n. \u91cd\u5927\u306a\u5931\u6557\u3092\u72af\u3057\u305f\u306e\u3067\u3001\u4ed5\u4e8b\u3092\u30af\u30d3\u306b\u306a\u3063\u305f\u3002\nBecause [I] made a serious mistake, [I] got fired.\nFrom this sentence, we extracted the event pair of \u201c\u91cd\u5927\u306a\u5931\u6557\u3092\u72af\u3059\u201d ([I] make a serious mistake) and \u201c\u4ed5\u4e8b\u3092\u30af\u30d3\u306b\u306a\u308b\u201d ([I] get fired), and tagged it with Cause.\nWe constructed our seed lexicon consisting of 15 positive words and 15 negative words, as shown in Section SECREF27. From the corpus of about 100 million sentences, we obtained 1.4 millions event pairs for AL, 41 millions for CA, and 6 millions for CO. We randomly selected subsets of AL event pairs such that positive and negative latter events were equal in size. We also sampled event pairs for each of CA and CO such that it was five times larger than AL. The results are shown in Table TABREF16.\nExperiments ::: Dataset ::: ACP (ACP Corpus)\nWe used the latest version of the ACP Corpus BIBREF12 for evaluation. It was used for (semi-)supervised training as well. Extracted from Japanese websites using HTML layouts and linguistic patterns, the dataset covered various genres. For example, the following two sentences were labeled positive and negative, respectively:\n. \u4f5c\u696d\u304c\u697d\u3060\u3002\nThe work is easy.\n. \u99d0\u8eca\u5834\u304c\u306a\u3044\u3002\nThere is no parking lot.\nAlthough the ACP corpus was originally constructed in the context of sentiment analysis, we found that it could roughly be regarded as a collection of affective events. We parsed each sentence and extracted the last clause in it. The train/dev/test split of the data is shown in Table TABREF19.\nThe objective function for supervised training is:\nwhere $v_i$ is the $i$-th event, $R_i$ is the reference score of $v_i$, and $N_{\\rm ACP}$ is the number of the events of the ACP Corpus.\nTo optimize the hyperparameters, we used the dev set of the ACP Corpus. For the evaluation, we used the test set of the ACP Corpus. The model output was classified as positive if $p(x) > 0$ and negative if $p(x) \\le 0$.\nExperiments ::: Model Configurations\nAs for ${\\rm Encoder}$, we compared two types of neural networks: BiGRU and BERT. GRU BIBREF16 is a recurrent neural network sequence encoder. BiGRU reads an input sequence forward and backward and the output is the concatenation of the final forward and backward hidden states.\nBERT BIBREF17 is a pre-trained multi-layer bidirectional Transformer BIBREF18 encoder. Its output is the final hidden state corresponding to the special classification tag ([CLS]). For the details of ${\\rm Encoder}$, see Sections SECREF30.\nWe trained the model with the following four combinations of the datasets: AL, AL+CA+CO (two proposed models), ACP (supervised), and ACP+AL+CA+CO (semi-supervised). The corresponding objective functions were: $\\mathcal {L}_{\\rm AL}$, $\\mathcal {L}_{\\rm AL} + \\mathcal {L}_{\\rm CA} + \\mathcal {L}_{\\rm CO}$, $\\mathcal {L}_{\\rm ACP}$, and $\\mathcal {L}_{\\rm ACP} + \\mathcal {L}_{\\rm AL} + \\mathcal {L}_{\\rm CA} + \\mathcal {L}_{\\rm CO}$.\nExperiments ::: Results and Discussion\nTable TABREF23 shows accuracy. As the Random baseline suggests, positive and negative labels were distributed evenly. The Random+Seed baseline made use of the seed lexicon and output the corresponding label (or the reverse of it for negation) if the event's predicate is in the seed lexicon. We can see that the seed lexicon itself had practically no impact on prediction.\nThe models in the top block performed considerably better than the random baselines. The performance gaps with their (semi-)supervised counterparts, shown in the middle block, were less than 7%. This demonstrates the effectiveness of discourse relation-based label propagation.\nComparing the model variants, we obtained the highest score with the BiGRU encoder trained with the AL+CA+CO dataset. BERT was competitive but its performance went down if CA and CO were used in addition to AL. We conjecture that BERT was more sensitive to noises found more frequently in CA and CO.\nContrary to our expectations, supervised models (ACP) outperformed semi-supervised models (ACP+AL+CA+CO). This suggests that the training set of 0.6 million events is sufficiently large for training the models. For comparison, we trained the models with a subset (6,000 events) of the ACP dataset. As the results shown in Table TABREF24 demonstrate, our method is effective when labeled data are small.\nThe result of hyperparameter optimization for the BiGRU encoder was as follows:\nAs the CA and CO pairs were equal in size (Table TABREF16), $\\lambda _{\\rm CA}$ and $\\lambda _{\\rm CO}$ were comparable values. $\\lambda _{\\rm CA}$ was about one-third of $\\lambda _{\\rm CO}$, and this indicated that the CA pairs were noisier than the CO pairs. A major type of CA pairs that violates our assumption was in the form of \u201c$\\textit {problem}_{\\text{negative}}$ causes $\\textit {solution}_{\\text{positive}}$\u201d:\n. (\u60aa\u3044\u3068\u3053\u308d\u304c\u3042\u308b, \u3088\u304f\u306a\u308b\u3088\u3046\u306b\u52aa\u529b\u3059\u308b)\n(there is a bad point, [I] try to improve [it])\nThe polarities of the two events were reversed in spite of the Cause relation, and this lowered the value of $\\lambda _{\\rm CA}$.\nSome examples of model outputs are shown in Table TABREF26. The first two examples suggest that our model successfully learned negation without explicit supervision. Similarly, the next two examples differ only in voice but the model correctly recognized that they had opposite polarities. The last two examples share the predicate \u201c\u843d\u3068\u3059\" (drop) and only the objects are different. The second event \u201c\u80a9\u3092\u843d\u3068\u3059\" (lit. drop one's shoulders) is an idiom that expresses a disappointed feeling. The examples demonstrate that our model correctly learned non-compositional expressions.\nConclusion\nIn this paper, we proposed to use discourse relations to effectively propagate polarities of affective events from seeds. Experiments show that, even with a minimal amount of supervision, the proposed method performed well.\nAlthough event pairs linked by discourse analysis are shown to be useful, they nevertheless contain noises. Adding linguistically-motivated filtering rules would help improve the performance.\nAcknowledgments\nWe thank Nobuhiro Kaji for providing the ACP Corpus and Hirokazu Kiyomaru and Yudai Kishimoto for their help in extracting event pairs. This work was partially supported by Yahoo! Japan Corporation.\nAppendices ::: Seed Lexicon ::: Positive Words\n\u559c\u3076 (rejoice), \u5b09\u3057\u3044 (be glad), \u697d\u3057\u3044 (be pleasant), \u5e78\u305b (be happy), \u611f\u52d5 (be impressed), \u8208\u596e (be excited), \u61d0\u304b\u3057\u3044 (feel nostalgic), \u597d\u304d (like), \u5c0a\u656c (respect), \u5b89\u5fc3 (be relieved), \u611f\u5fc3 (admire), \u843d\u3061\u7740\u304f (be calm), \u6e80\u8db3 (be satisfied), \u7652\u3055\u308c\u308b (be healed), and \u30b9\u30c3\u30ad\u30ea (be refreshed).\nAppendices ::: Seed Lexicon ::: Negative Words\n\u6012\u308b (get angry), \u60b2\u3057\u3044 (be sad), \u5bc2\u3057\u3044 (be lonely), \u6016\u3044 (be scared), \u4e0d\u5b89 (feel anxious), \u6065\u305a\u304b\u3057\u3044 (be embarrassed), \u5acc (hate), \u843d\u3061\u8fbc\u3080 (feel down), \u9000\u5c48 (be bored), \u7d76\u671b (feel hopeless), \u8f9b\u3044 (have a hard time), \u56f0\u308b (have trouble), \u6182\u9b31 (be depressed), \u5fc3\u914d (be worried), and \u60c5\u3051\u306a\u3044 (be sorry).\nAppendices ::: Settings of Encoder ::: BiGRU\nThe dimension of the embedding layer was 256. The embedding layer was initialized with the word embeddings pretrained using the Web corpus. The input sentences were segmented into words by the morphological analyzer Juman++. The vocabulary size was 100,000. The number of hidden layers was 2. The dimension of hidden units was 256. The optimizer was Momentum SGD BIBREF21. The mini-batch size was 1024. We ran 100 epochs and selected the snapshot that achieved the highest score for the dev set.\nAppendices ::: Settings of Encoder ::: BERT\nWe used a Japanese BERT model pretrained with Japanese Wikipedia. The input sentences were segmented into words by Juman++, and words were broken into subwords by applying BPE BIBREF20. The vocabulary size was 32,000. The maximum length of an input sequence was 128. The number of hidden layers was 12. The dimension of hidden units was 768. The number of self-attention heads was 12. The optimizer was Adam BIBREF19. The mini-batch size was 32. We ran 1 epoch.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What data is the language model pretrained on?", "What baselines is the proposed model compared against?", "How is the clinical text structuring task defined?", "How is the clinical text structuring task defined?", "What are the specific tasks being unified?", "Is all text in this dataset a question, or are there unrelated sentences in between questions?", "How many questions are in the dataset?", "How they introduce domain-specific features into pre-trained language model?", "How big is QA-CTS task dataset?", "What are strong baseline models in specific tasks?"], "outputs": ["Chinese general corpus", "BERT-Base, QANet", "Clinical text structuring (CTS) is a critical task for fetching medical research data from electronic health records (EHRs), where structural patient medical data, such as whether the patient has specific symptoms, diseases, or what the tumor size is, how far from the tumor is cut at during the surgery, or what the specific laboratory test result is, are obtained., Unlike the traditional CTS task, our QA-CTS task aims to discover the most related text from original paragraph text. ", "CTS is extracting structural data from medical research data (unstructured). Authors define QA-CTS task that aims to discover most related text from original text.", " three types of questions, namely tumor size, proximal resection margin and distal resection margin", "the dataset consists of pathology reports including sentences and questions and answers about tumor size and resection margins so it does include additional sentences ", "2,714 ", "integrate clinical named entity information into pre-trained language model", "17,833 sentences, 826,987 characters and 2,714 question-answer pairs", "state-of-the-art question answering models (i.e. QANet BIBREF39) and BERT-Base BIBREF26"], "input": "Introduction\nClinical text structuring (CTS) is a critical task for fetching medical research data from electronic health records (EHRs), where structural patient medical data, such as whether the patient has specific symptoms, diseases, or what the tumor size is, how far from the tumor is cut at during the surgery, or what the specific laboratory test result is, are obtained. It is important to extract structured data from clinical text because bio-medical systems or bio-medical researches greatly rely on structured data but they cannot obtain them directly. In addition, clinical text often contains abundant healthcare information. CTS is able to provide large-scale extracted structured data for enormous down-stream clinical researches.\nHowever, end-to-end CTS is a very challenging task. Different CTS tasks often have non-uniform output formats, such as specific-class classifications (e.g. tumor stage), strings in the original text (e.g. result for a laboratory test) and inferred values from part of the original text (e.g. calculated tumor size). Researchers have to construct different models for it, which is already costly, and hence it calls for a lot of labeled data for each model. Moreover, labeling necessary amount of data for training neural network requires expensive labor cost. To handle it, researchers turn to some rule-based structuring methods which often have lower labor cost.\nTraditionally, CTS tasks can be addressed by rule and dictionary based methods BIBREF0, BIBREF1, BIBREF2, task-specific end-to-end methods BIBREF3, BIBREF4, BIBREF5, BIBREF6 and pipeline methods BIBREF7, BIBREF8, BIBREF9. Rule and dictionary based methods suffer from costly human-designed extraction rules, while task-specific end-to-end methods have non-uniform output formats and require task-specific training dataset. Pipeline methods break down the entire process into several pieces which improves the performance and generality. However, when the pipeline depth grows, error propagation will have a greater impact on the performance.\nTo reduce the pipeline depth and break the barrier of non-uniform output formats, we present a question answering based clinical text structuring (QA-CTS) task (see Fig. FIGREF1). Unlike the traditional CTS task, our QA-CTS task aims to discover the most related text from original paragraph text. For some cases, it is already the final answer in deed (e.g., extracting sub-string). While for other cases, it needs several steps to obtain the final answer, such as entity names conversion and negative words recognition. Our presented QA-CTS task unifies the output format of the traditional CTS task and make the training data shareable, thus enriching the training data. The main contributions of this work can be summarized as follows.\nWe first present a question answering based clinical text structuring (QA-CTS) task, which unifies different specific tasks and make dataset shareable. We also propose an effective model to integrate clinical named entity information into pre-trained language model.\nExperimental results show that QA-CTS task leads to significant improvement due to shared dataset. Our proposed model also achieves significantly better performance than the strong baseline methods. In addition, we also show that two-stage training mechanism has a great improvement on QA-CTS task.\nThe rest of the paper is organized as follows. We briefly review the related work on clinical text structuring in Section SECREF2. Then, we present question answer based clinical text structuring task in Section SECREF3. In Section SECREF4, we present an effective model for this task. Section SECREF5 is devoted to computational studies and several investigations on the key issues of our proposed model. Finally, conclusions are given in Section SECREF6.\nRelated Work ::: Clinical Text Structuring\nClinical text structuring is a final problem which is highly related to practical applications. Most of existing studies are case-by-case. Few of them are developed for the general purpose structuring task. These studies can be roughly divided into three categories: rule and dictionary based methods, task-specific end-to-end methods and pipeline methods.\nRule and dictionary based methods BIBREF0, BIBREF1, BIBREF2 rely extremely on heuristics and handcrafted extraction rules which is more of an art than a science and incurring extensive trial-and-error experiments. Fukuda et al. BIBREF0 identified protein names from biological papers by dictionaries and several features of protein names. Wang et al. BIBREF1 developed some linguistic rules (i.e. normalised/expanded term matching and substring term matching) to map specific terminology to SNOMED CT. Song et al. BIBREF2 proposed a hybrid dictionary-based bio-entity extraction technique and expands the bio-entity dictionary by combining different data sources and improves the recall rate through the shortest path edit distance algorithm. This kind of approach features its interpretability and easy modifiability. However, with the increase of the rule amount, supplementing new rules to existing system will turn to be a rule disaster.\nTask-specific end-to-end methods BIBREF3, BIBREF4 use large amount of data to automatically model the specific task. Topaz et al. BIBREF3 constructed an automated wound information identification model with five output. Tan et al. BIBREF4 identified patients undergoing radical cystectomy for bladder cancer. Although they achieved good performance, none of their models could be used to another task due to output format difference. This makes building a new model for a new task a costly job.\nPipeline methods BIBREF7, BIBREF8, BIBREF9 break down the entire task into several basic natural language processing tasks. Bill et al. BIBREF7 focused on attributes extraction which mainly relied on dependency parsing and named entity recognition BIBREF10, BIBREF11, BIBREF12. Meanwhile, Fonferko et al. BIBREF9 used more components like noun phrase chunking BIBREF13, BIBREF14, BIBREF15, part-of-speech tagging BIBREF16, BIBREF17, BIBREF18, sentence splitter, named entity linking BIBREF19, BIBREF20, BIBREF21, relation extraction BIBREF22, BIBREF23. This kind of method focus on language itself, so it can handle tasks more general. However, as the depth of pipeline grows, it is obvious that error propagation will be more and more serious. In contrary, using less components to decrease the pipeline depth will lead to a poor performance. So the upper limit of this method depends mainly on the worst component.\nRelated Work ::: Pre-trained Language Model\nRecently, some works focused on pre-trained language representation models to capture language information from text and then utilizing the information to improve the performance of specific natural language processing tasks BIBREF24, BIBREF25, BIBREF26, BIBREF27 which makes language model a shared model to all natural language processing tasks. Radford et al. BIBREF24 proposed a framework for fine-tuning pre-trained language model. Peters et al. BIBREF25 proposed ELMo which concatenates forward and backward language models in a shallow manner. Devlin et al. BIBREF26 used bidirectional Transformers to model deep interactions between the two directions. Yang et al. BIBREF27 replaced the fixed forward or backward factorization order with all possible permutations of the factorization order and avoided using the [MASK] tag which causes pretrain-finetune discrepancy that BERT is subject to.\nThe main motivation of introducing pre-trained language model is to solve the shortage of labeled data and polysemy problem. Although polysemy problem is not a common phenomenon in biomedical domain, shortage of labeled data is always a non-trivial problem. Lee et al. BIBREF28 applied BERT on large-scale biomedical unannotated data and achieved improvement on biomedical named entity recognition, relation extraction and question answering. Kim et al. BIBREF29 adapted BioBERT into multi-type named entity recognition and discovered new entities. Both of them demonstrates the usefulness of introducing pre-trained language model into biomedical domain.\nQuestion Answering based Clinical Text Structuring\nGiven a sequence of paragraph text $X=<x_1, x_2, ..., x_n>$, clinical text structuring (CTS) can be regarded to extract or generate a key-value pair where key $Q$ is typically a query term such as proximal resection margin and value $V$ is a result of query term $Q$ according to the paragraph text $X$.\nGenerally, researchers solve CTS problem in two steps. Firstly, the answer-related text is pick out. And then several steps such as entity names conversion and negative words recognition are deployed to generate the final answer. While final answer varies from task to task, which truly causes non-uniform output formats, finding the answer-related text is a common action among all tasks. Traditional methods regard both the steps as a whole. In this paper, we focus on finding the answer-related substring $Xs = <X_i, X_i+1, X_i+2, ... X_j> (1 <= i < j <= n)$ from paragraph text $X$. For example, given sentence UTF8gkai\u201c\u8fdc\u7aef\u80c3\u5207\u9664\u6807\u672c\uff1a\u5c0f\u5f2f\u957f11.5cm\uff0c\u5927\u5f2f\u957f17.0cm\u3002\u8ddd\u4e0a\u5207\u7aef6.0cm\u3001\u4e0b\u5207\u7aef8.0cm\" (Distal gastrectomy specimen: measuring 11.5cm in length along the lesser curvature, 17.0cm in length along the greater curvature; 6.0cm from the proximal resection margin, and 8.0cm from the distal resection margin) and query UTF8gkai\u201c\u4e0a\u5207\u7f18\u8ddd\u79bb\"(proximal resection margin), the answer should be 6.0cm which is located in original text from index 32 to 37. With such definition, it unifies the output format of CTS tasks and therefore make the training data shareable, in order to reduce the training data quantity requirement.\nSince BERT BIBREF26 has already demonstrated the usefulness of shared model, we suppose extracting commonality of this problem and unifying the output format will make the model more powerful than dedicated model and meanwhile, for a specific clinical task, use the data for other tasks to supplement the training data.\nThe Proposed Model for QA-CTS Task\nIn this section, we present an effective model for the question answering based clinical text structuring (QA-CTS). As shown in Fig. FIGREF8, paragraph text $X$ is first passed to a clinical named entity recognition (CNER) model BIBREF12 to capture named entity information and obtain one-hot CNER output tagging sequence for query text $I_{nq}$ and paragraph text $I_{nt}$ with BIEOS (Begin, Inside, End, Outside, Single) tag scheme. $I_{nq}$ and $I_{nt}$ are then integrated together into $I_n$. Meanwhile, the paragraph text $X$ and query text $Q$ are organized and passed to contextualized representation model which is pre-trained language model BERT BIBREF26 here to obtain the contextualized representation vector $V_s$ of both text and query. Afterwards, $V_s$ and $I_n$ are integrated together and fed into a feed forward network to calculate the start and end index of answer-related text. Here we define this calculation problem as a classification for each word to be the start or end word.\nThe Proposed Model for QA-CTS Task ::: Contextualized Representation of Sentence Text and Query Text\nFor any clinical free-text paragraph $X$ and query $Q$, contextualized representation is to generate the encoded vector of both of them. Here we use pre-trained language model BERT-base BIBREF26 model to capture contextual information.\nThe text input is constructed as `[CLS] $Q$ [SEP] $X$ [SEP]'. For Chinese sentence, each word in this input will be mapped to a pre-trained embedding $e_i$. To tell the model $Q$ and $X$ is two different sentence, a sentence type input is generated which is a binary label sequence to denote what sentence each character in the input belongs to. Positional encoding and mask matrix is also constructed automatically to bring in absolute position information and eliminate the impact of zero padding respectively. Then a hidden vector $V_s$ which contains both query and text information is generated through BERT-base model.\nThe Proposed Model for QA-CTS Task ::: Clinical Named Entity Information\nSince BERT is trained on general corpus, its performance on biomedical domain can be improved by introducing biomedical domain-specific features. In this paper, we introduce clinical named entity information into the model.\nThe CNER task aims to identify and classify important clinical terms such as diseases, symptoms, treatments, exams, and body parts from Chinese EHRs. It can be regarded as a sequence labeling task. A CNER model typically outputs a sequence of tags. Each character of the original sentence will be tagged a label following a tag scheme. In this paper we recognize the entities by the model of our previous work BIBREF12 but trained on another corpus which has 44 entity types including operations, numbers, unit words, examinations, symptoms, negative words, etc. An illustrative example of named entity information sequence is demonstrated in Table TABREF2. In Table TABREF2, UTF8gkai\u201c\u8fdc\u7aef\u80c3\u5207\u9664\" is tagged as an operation, `11.5' is a number word and `cm' is an unit word. The named entity tag sequence is organized in one-hot type. We denote the sequence for clinical sentence and query term as $I_{nt}$ and $I_{nq}$, respectively.\nThe Proposed Model for QA-CTS Task ::: Integration Method\nThere are two ways to integrate two named entity information vectors $I_{nt}$ and $I_{nq}$ or hidden contextualized representation $V_s$ and named entity information $I_n$, where $I_n = [I_{nt}; I_{nq}]$. The first one is to concatenate them together because they have sequence output with a common dimension. The second one is to transform them into a new hidden representation. For the concatenation method, the integrated representation is described as follows.\nWhile for the transformation method, we use multi-head attention BIBREF30 to encode the two vectors. It can be defined as follows where $h$ is the number of heads and $W_o$ is used to projects back the dimension of concatenated matrix.\n$Attention$ denotes the traditional attention and it can be defined as follows.\nwhere $d_k$ is the length of hidden vector.\nThe Proposed Model for QA-CTS Task ::: Final Prediction\nThe final step is to use integrated representation $H_i$ to predict the start and end index of answer-related text. Here we define this calculation problem as a classification for each word to be the start or end word. We use a feed forward network (FFN) to compress and calculate the score of each word $H_f$ which makes the dimension to $\\left\\langle l_s, 2\\right\\rangle $ where $l_s$ denotes the length of sequence.\nThen we permute the two dimensions for softmax calculation. The calculation process of loss function can be defined as followed.\nwhere $O_s = softmax(permute(H_f)_0)$ denotes the probability score of each word to be the start word and similarly $O_e = softmax(permute(H_f)_1)$ denotes the end. $y_s$ and $y_e$ denotes the true answer of the output for start word and end word respectively.\nThe Proposed Model for QA-CTS Task ::: Two-Stage Training Mechanism\nTwo-stage training mechanism is previously applied on bilinear model in fine-grained visual recognition BIBREF31, BIBREF32, BIBREF33. Two CNNs are deployed in the model. One is trained at first for coarse-graind features while freezing the parameter of the other. Then unfreeze the other one and train the entire model in a low learning rate for fetching fine-grained features.\nInspired by this and due to the large amount of parameters in BERT model, to speed up the training process, we fine tune the BERT model with new prediction layer first to achieve a better contextualized representation performance. Then we deploy the proposed model and load the fine tuned BERT weights, attach named entity information layers and retrain the model.\nExperimental Studies\nIn this section, we devote to experimentally evaluating our proposed task and approach. The best results in tables are in bold.\nExperimental Studies ::: Dataset and Evaluation Metrics\nOur dataset is annotated based on Chinese pathology reports provided by the Department of Gastrointestinal Surgery, Ruijin Hospital. It contains 17,833 sentences, 826,987 characters and 2,714 question-answer pairs. All question-answer pairs are annotated and reviewed by four clinicians with three types of questions, namely tumor size, proximal resection margin and distal resection margin. These annotated instances have been partitioned into 1,899 training instances (12,412 sentences) and 815 test instances (5,421 sentences). Each instance has one or several sentences. Detailed statistics of different types of entities are listed in Table TABREF20.\nIn the following experiments, two widely-used performance measures (i.e., EM-score BIBREF34 and (macro-averaged) F$_1$-score BIBREF35) are used to evaluate the methods. The Exact Match (EM-score) metric measures the percentage of predictions that match any one of the ground truth answers exactly. The F$_1$-score metric is a looser metric measures the average overlap between the prediction and ground truth answer.\nExperimental Studies ::: Experimental Settings\nTo implement deep neural network models, we utilize the Keras library BIBREF36 with TensorFlow BIBREF37 backend. Each model is run on a single NVIDIA GeForce GTX 1080 Ti GPU. The models are trained by Adam optimization algorithm BIBREF38 whose parameters are the same as the default settings except for learning rate set to $5\\times 10^{-5}$. Batch size is set to 3 or 4 due to the lack of graphical memory. We select BERT-base as the pre-trained language model in this paper. Due to the high cost of pre-training BERT language model, we directly adopt parameters pre-trained by Google in Chinese general corpus. The named entity recognition is applied on both pathology report texts and query texts.\nExperimental Studies ::: Comparison with State-of-the-art Methods\nSince BERT has already achieved the state-of-the-art performance of question-answering, in this section we compare our proposed model with state-of-the-art question answering models (i.e. QANet BIBREF39) and BERT-Base BIBREF26. As BERT has two versions: BERT-Base and BERT-Large, due to the lack of computational resource, we can only compare with BERT-Base model instead of BERT-Large. Prediction layer is attached at the end of the original BERT-Base model and we fine tune it on our dataset. In this section, the named entity integration method is chosen to pure concatenation (Concatenate the named entity information on pathology report text and query text first and then concatenate contextualized representation and concatenated named entity information). Comparative results are summarized in Table TABREF23.\nTable TABREF23 indicates that our proposed model achieved the best performance both in EM-score and F$_1$-score with EM-score of 91.84% and F$_1$-score of 93.75%. QANet outperformed BERT-Base with 3.56% score in F$_1$-score but underperformed it with 0.75% score in EM-score. Compared with BERT-Base, our model led to a 5.64% performance improvement in EM-score and 3.69% in F$_1$-score. Although our model didn't outperform much with QANet in F$_1$-score (only 0.13%), our model significantly outperformed it with 6.39% score in EM-score.\nExperimental Studies ::: Ablation Analysis\nTo further investigate the effects of named entity information and two-stage training mechanism for our model, we apply ablation analysis to see the improvement brought by each of them, where $\\times $ refers to removing that part from our model.\nAs demonstrated in Table TABREF25, with named entity information enabled, two-stage training mechanism improved the result by 4.36% in EM-score and 3.8% in F$_1$-score. Without two-stage training mechanism, named entity information led to an improvement by 1.28% in EM-score but it also led to a weak deterioration by 0.12% in F$_1$-score. With both of them enabled, our proposed model achieved a 5.64% score improvement in EM-score and a 3.69% score improvement in F$_1$-score. The experimental results show that both named entity information and two-stage training mechanism are helpful to our model.\nExperimental Studies ::: Comparisons Between Two Integration Methods\nThere are two methods to integrate named entity information into existing model, we experimentally compare these two integration methods. As named entity recognition has been applied on both pathology report text and query text, there will be two integration here. One is for two named entity information and the other is for contextualized representation and integrated named entity information. For multi-head attention BIBREF30, we set heads number $h = 16$ with 256-dimension hidden vector size for each head.\nFrom Table TABREF27, we can observe that applying concatenation on both periods achieved the best performance on both EM-score and F$_1$-score. Unfortunately, applying multi-head attention on both period one and period two can not reach convergence in our experiments. This probably because it makes the model too complex to train. The difference on other two methods are the order of concatenation and multi-head attention. Applying multi-head attention on two named entity information $I_{nt}$ and $I_{nq}$ first achieved a better performance with 89.87% in EM-score and 92.88% in F$_1$-score. Applying Concatenation first can only achieve 80.74% in EM-score and 84.42% in F$_1$-score. This is probably due to the processing depth of hidden vectors and dataset size. BERT's output has been modified after many layers but named entity information representation is very close to input. With big amount of parameters in multi-head attention, it requires massive training to find out the optimal parameters. However, our dataset is significantly smaller than what pre-trained BERT uses. This probably can also explain why applying multi-head attention method on both periods can not converge.\nAlthough Table TABREF27 shows the best integration method is concatenation, multi-head attention still has great potential. Due to the lack of computational resources, our experiment fixed the head number and hidden vector size. However, tuning these hyper parameters may have impact on the result. Tuning integration method and try to utilize larger datasets may give help to improving the performance.\nExperimental Studies ::: Data Integration Analysis\nTo investigate how shared task and shared model can benefit, we split our dataset by query types, train our proposed model with different datasets and demonstrate their performance on different datasets. Firstly, we investigate the performance on model without two-stage training and named entity information.\nAs indicated in Table TABREF30, The model trained by mixed data outperforms 2 of the 3 original tasks in EM-score with 81.55% for proximal resection margin and 86.85% for distal resection margin. The performance on tumor size declined by 1.57% score in EM-score and 3.14% score in F$_1$-score but they were still above 90%. 0.69% and 0.37% score improvement in EM-score was brought by shared model for proximal and distal resection margin prediction. Meanwhile F$_1$-score for those two tasks declined 3.11% and 0.77% score.\nThen we investigate the performance on model with two-stage training and named entity information. In this experiment, pre-training process only use the specific dataset not the mixed data. From Table TABREF31 we can observe that the performance on proximal and distal resection margin achieved the best performance on both EM-score and F$_1$-score. Compared with Table TABREF30, the best performance on proximal resection margin improved by 6.9% in EM-score and 7.94% in F$_1$-score. Meanwhile, the best performance on distal resection margin improved by 5.56% in EM-score and 6.32% in F$_1$-score. Other performances also usually improved a lot. This proves the usefulness of two-stage training and named entity information as well.\nLastly, we fine tune the model for each task with a pre-trained parameter. Table TABREF32 summarizes the result. (Add some explanations for the Table TABREF32). Comparing Table TABREF32 with Table TABREF31, using mixed-data pre-trained parameters can significantly improve the model performance than task-specific data trained model. Except tumor size, the result was improved by 0.52% score in EM-score, 1.39% score in F$_1$-score for proximal resection margin and 2.6% score in EM-score, 2.96% score in F$_1$-score for distal resection margin. This proves mixed-data pre-trained parameters can lead to a great benefit for specific task. Meanwhile, the model performance on other tasks which are not trained in the final stage was also improved from around 0 to 60 or 70 percent. This proves that there is commonality between different tasks and our proposed QA-CTS task make this learnable. In conclusion, to achieve the best performance for a specific dataset, pre-training the model in multiple datasets and then fine tuning the model on the specific dataset is the best way.\nConclusion\nIn this paper, we present a question answering based clinical text structuring (QA-CTS) task, which unifies different clinical text structuring tasks and utilize different datasets. A novel model is also proposed to integrate named entity information into a pre-trained language model and adapt it to QA-CTS task. Initially, sequential results of named entity recognition on both paragraph and query texts are integrated together. Contextualized representation on both paragraph and query texts are transformed by a pre-trained language model. Then, the integrated named entity information and contextualized representation are integrated together and fed into a feed forward network for final prediction. Experimental results on real-world dataset demonstrate that our proposed model competes favorably with strong baseline models in all three specific tasks. The shared task and shared model introduced by QA-CTS task has also been proved to be useful for improving the performance on most of the task-specific datasets. In conclusion, the best way to achieve the best performance for a specific dataset is to pre-train the model in multiple datasets and then fine tune it on the specific dataset.\nAcknowledgment\nWe would like to thank Ting Li and Xizhou Hong (Ruijin Hospital) who have helped us very much in data fetching and data cleansing. This work is supported by the National Key R&D Program of China for \u201cPrecision Medical Research\" (No. 2018YFC0910500).", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What is the previous work's model?", "What dataset is used?", "How big is the dataset?", "How big is the dataset?", "How is the dataset collected?", "Was each text augmentation technique experimented individually?", "What models do previous work use?", "Does the dataset contain content from various social media platforms?"], "outputs": ["Ternary Trans-CNN", "HEOT , A labelled dataset for a corresponding english tweets", "3189 rows of text messages", "Resulting dataset was 7934 messages for train and 700 messages for test.", "A labelled dataset for a corresponding english tweets were also obtained from a study conducted by Davidson et al, HEOT obtained from one of the past studies done by Mathur et al", "No", "Ternary Trans-CNN , Hybrid multi-channel CNN and LSTM", "No"], "input": "Introduction\nHinglish is a linguistic blend of Hindi (very widely spoken language in India) and English (an associate language of urban areas) and is spoken by upwards of 350 million people in India. While the name is based on the Hindi language, it does not refer exclusively to Hindi, but is used in India, with English words blending with Punjabi, Gujarati, Marathi and Hindi. Sometimes, though rarely, Hinglish is used to refer to Hindi written in English script and mixing with English words or phrases. This makes analyzing the language very interesting. Its rampant usage in social media like Twitter, Facebook, Online blogs and reviews has also led to its usage in delivering hate and abuses in similar platforms. We aim to find such content in the social media focusing on the tweets. Hypothetically, if we can classify such tweets, we might be able to detect them and isolate them for further analysis before it reaches public. This will a great application of AI to the social cause and thus is motivating. An example of a simple, non offensive message written in Hinglish could be:\n\"Why do you waste your time with <redacted content>. Aapna ghar sambhalta nahi(<redacted content>). Chale dusro ko basane..!!\"\nThe second part of the above sentence is written in Hindi while the first part is in English. Second part calls for an action to a person to bring order to his/her home before trying to settle others.\nIntroduction ::: Modeling challenges\nFrom the modeling perspective there are couple of challenges introduced by the language and the labelled dataset. Generally, Hinglish follows largely fuzzy set of rules which evolves and is dependent upon the users preference. It doesn't have any formal definitions and thus the rules of usage are ambiguous. Thus, when used by different users the text produced may differ. Overall the challenges posed by this problem are:\nGeographical variation: Depending upon the geography of origination, the content may be be highly influenced by the underlying region.\nLanguage and phonetics variation: Based on a census in 2001, India has 122 major languages and 1599 other languages. The use of Hindi and English in a code switched setting is highly influenced by these language.\nNo grammar rules: Hinglish has no fixed set of grammar rules. The rules are inspired from both Hindi and English and when mixed with slur and slang produce large variation.\nSpelling variation: There is no agreement on the spellings of the words which are mixed with English. For example to express love, a code mixed spelling, specially when used social platforms might be pyaar, pyar or pyr.\nDataset: Based on some earlier work, only available labelled dataset had 3189 rows of text messages of average length of 116 words and with a range of 1, 1295. Prior work addresses this concern by using Transfer Learning on an architecture learnt on about 14,500 messages with an accuracy of 83.90. We addressed this concern using data augmentation techniques applied on text data.\nRelated Work ::: Transfer learning based approaches\nMathur et al. in their paper for detecting offensive tweets proposed a Ternary Trans-CNN model where they train a model architecture comprising of 3 layers of Convolution 1D having filter sizes of 15, 12 and 10 and kernel size of 3 followed by 2 dense fully connected layer of size 64 and 3. The first dense FC layer has ReLU activation while the last Dense layer had Softmax activation. They were able to train this network on a parallel English dataset provided by Davidson et al. The authors were able to achieve Accuracy of 83.9%, Precision of 80.2%, Recall of 69.8%.\nThe approach looked promising given that the dataset was merely 3189 sentences divided into three categories and thus we replicated the experiment but failed to replicate the results. The results were poor than what the original authors achieved. But, most of the model hyper-parameter choices where inspired from this work.\nRelated Work ::: Hybrid models\nIn another localized setting of Vietnamese language, Nguyen et al. in 2017 proposed a Hybrid multi-channel CNN and LSTM model where they build feature maps for Vietnamese language using CNN to capture shorterm dependencies and LSTM to capture long term dependencies and concatenate both these feature sets to learn a unified set of features on the messages. These concatenated feature vectors are then sent to a few fully connected layers. They achieved an accuracy rate of 87.3% with this architecture.\nDataset and Features\nWe used dataset, HEOT obtained from one of the past studies done by Mathur et al. where they annotated a set of cleaned tweets obtained from twitter for the conversations happening in Indian subcontinent. A labelled dataset for a corresponding english tweets were also obtained from a study conducted by Davidson et al. This dataset was important to employ Transfer Learning to our task since the number of labeled dataset was very small. Basic summary and examples of the data from the dataset are below:\nDataset and Features ::: Challenges\nThe obtained data set had many challenges and thus a data preparation task was employed to clean the data and make it ready for the deep learning pipeline. The challenges and processes that were applied are stated below:\nMessy text messages: The tweets had urls, punctuations, username mentions, hastags, emoticons, numbers and lots of special characters. These were all cleaned up in a preprocessing cycle to clean the data.\nStop words: Stop words corpus obtained from NLTK was used to eliminate most unproductive words which provide little information about individual tweets.\nTransliteration: Followed by above two processes, we translated Hinglish tweets into English words using a two phase process\nTransliteration: In phase I, we used translation API's provided by Google translation services and exposed via a SDK, to transliteration the Hinglish messages to English messages.\nTranslation: After transliteration, words that were specific to Hinglish were translated to English using an Hinglish-English dictionary. By doing this we converted the Hinglish message to and assortment of isolated words being presented in the message in a sequence that can also be represented using word to vector representation.\nData augmentation: Given the data set was very small with a high degree of imbalance in the labelled messages for three different classes, we employed a data augmentation technique to boost the learning of the deep network. Following techniques from the paper by Jason et al. was utilized in this setting that really helped during the training phase.Thsi techniques wasnt used in previous studies. The techniques were:\nSynonym Replacement (SR):Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.\nRandom Insertion (RI):Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.\nRandom Swap (RS):Randomly choose two words in the sentence and swap their positions. Do this n times.\nRandom Deletion (RD):For each word in the sentence, randomly remove it with probability p.\nWord Representation: We used word embedding representations by Glove for creating word embedding layers and to obtain the word sequence vector representations of the processed tweets. The pre-trained embedding dimension were one of the hyperparamaters for model. Further more, we introduced another bit flag hyperparameter that determined if to freeze these learnt embedding.\nTrain-test split: The labelled dataset that was available for this task was very limited in number of examples and thus as noted above few data augmentation techniques were applied to boost the learning of the network. Before applying augmentation, a train-test split of 78%-22% was done from the original, cleansed data set. Thus, 700 tweets/messages were held out for testing. All model evaluation were done in on the test set that got generated by this process. The results presented in this report are based on the performance of the model on the test set. The training set of 2489 messages were however sent to an offline pipeline for augmenting the data. The resulting training dataset was thus 7934 messages. the final distribution of messages for training and test was thus below:\nModel Architecture\nWe tested the performance of various model architectures by running our experiment over 100 times on a CPU based compute which later as migrated to GPU based compute to overcome the slow learning progress. Our universal metric for minimizing was the validation loss and we employed various operational techniques for optimizing on the learning process. These processes and its implementation details will be discussed later but they were learning rate decay, early stopping, model checkpointing and reducing learning rate on plateau.\nModel Architecture ::: Loss function\nFor the loss function we chose categorical cross entropy loss in finding the most optimal weights/parameters of the model. Formally this loss function for the model is defined as below:\nThe double sum is over the number of observations and the categories respectively. While the model probability is the probability that the observation i belongs to category c.\nModel Architecture ::: Models\nAmong the model architectures we experimented with and without data augmentation were:\nFully Connected dense networks: Model hyperparameters were inspired from the previous work done by Vo et al and Mathur et al. This was also used as a baseline model but we did not get appreciable performance on such architecture due to FC networks not being able to capture local and long term dependencies.\nConvolution based architectures: Architecture and hyperparameter choices were chosen from the past study Deon on the subject. We were able to boost the performance as compared to only FC based network but we noticed better performance from architectures that are suitable to sequences such as text messages or any timeseries data.\nSequence models: We used SimpleRNN, LSTM, GRU, Bidirectional LSTM model architecture to capture long term dependencies of the messages in determining the class the message or the tweet belonged to.\nBased on all the experiments we conducted below model had best performance related to metrics - Recall rate, F1 score and Overall accuracy.\nModel Architecture ::: Hyper parameters\nChoice of model parameters were in the above models were inspired from previous work done but then were tuned to the best performance of the Test dataset. Following parameters were considered for tuning.\nLearning rate: Based on grid search the best performance was achieved when learning rate was set to 0.01. This value was arrived by a grid search on lr parameter.\nNumber of Bidirectional LSTM units: A set of 32, 64, 128 hidden activation units were considered for tuning the model. 128 was a choice made by Vo et al in modeling for Vietnamese language but with our experiments and with a small dataset to avoid overfitting to train dataset, a smaller unit sizes were considered.\nEmbedding dimension: 50, 100 and 200 dimension word representation from Glove word embedding were considered and the best results were obtained with 100d representation, consistent with choices made in the previous work.\nTransfer learning on Embedding; Another bit flag for training the embedding on the train data or freezing the embedding from Glove was used. It was determined that set of pre-trained weights from Glove was best when it was fine tuned with Hinglish data. It provides evidence that a separate word or sentence level embedding when learnt for Hinglish text analysis will be very useful.\nNumber of dense FC layers.\nMaximum length of the sequence to be considered: The max length of tweets/message in the dataset was 1265 while average was 116. We determined that choosing 200 resulted in the best performance.\nResults\nDuring our experimentation, it was evident that this is a hard problem especially detecting the hate speech, text in a code- mixed language. The best recall rate of 77 % for hate speech was obtained by a Bidirectional LSTM with 32 units with a recurrent drop out rate of 0.2. Precision wise GRU type of RNN sequence model faired better than other kinds for hate speech detection. On the other hand for detecting offensive and non offensive tweets, fairly satisfactory results were obtained. For offensive tweets, 92 % precision was and recall rate of 88% was obtained with GRU versus BiLSTM based models. Comparatively, Recall of 85 % and precision of 76 % was obtained by again GRU and BiLSTM based models as shown and marked in the results.\nConclusion and Future work\nThe results of the experiments are encouraging on detective offensive vs non offensive tweets and messages written in Hinglish in social media. The utilization of data augmentation technique in this classification task was one of the vital contributions which led us to surpass results obtained by previous state of the art Hybrid CNN-LSTM based models. However, the results of the model for predicting hateful tweets on the contrary brings forth some shortcomings of the model. The biggest shortcoming on the model based on error analysis indicates less than generalized examples presented by the dataset. We also note that the embedding learnt from the Hinglish data set may be lacking and require extensive training to have competent word representations of Hinglish text. Given this learning's, we identify that creating word embeddings on much larger Hinglish corpora may have significant results. We also hypothesize that considering alternate methods than translation and transliteration may prove beneficial.\nReferences\n[1] Mathur, Puneet and Sawhney, Ramit and Ayyar, Meghna and Shah, Rajiv, Did you offend me? classification of offensive tweets in hinglish language, Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)\n[2] Mathur, Puneet and Shah, Rajiv and Sawhney, Ramit and Mahata, Debanjan Detecting offensive tweets in hindi-english code-switched language Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media\n[3] Vo, Quan-Hoang and Nguyen, Huy-Tien and Le, Bac and Nguyen, Minh-Le Multi-channel LSTM-CNN model for Vietnamese sentiment analysis 2017 9th international conference on knowledge and systems engineering (KSE)\n[4] Hochreiter, Sepp and Schmidhuber, J\u00fcrgen Long short-term memory Neural computation 1997\n[5] Sinha, R Mahesh K and Thakur, Anil Multi-channel LSTM-CNN model for Vietnamese sentiment analysis 2017 9th international conference on knowledge and systems engineering (KSE)\n[6] Pennington, Jeffrey and Socher, Richard and Manning, Christopher Glove: Global vectors for word representation Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)\n[7] Zhang, Lei and Wang, Shuai and Liu, Bing Deep learning for sentiment analysis: A survey Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery\n[8] Caruana, Rich and Lawrence, Steve and Giles, C Lee Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping Advances in neural information processing systems\n[9] Beale, Mark Hudson and Hagan, Martin T and Demuth, Howard B Neural network toolbox user\u2019s guide The MathWorks Incs\n[10] Chollet, Fran\u00e7ois and others Keras: The python deep learning library Astrophysics Source Code Library\n[11] Wei, Jason and Zou, Kai EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What additional techniques are incorporated?", "What dataset do they use?", "Do they compare to other models?", "What is the architecture of the system?", "What additional techniques could be incorporated to further improve accuracy?", "What programming language is target language?", "What dataset is used to measure accuracy?"], "outputs": ["incorporating coding syntax tree model", "A parallel corpus where the source is an English expression of code and the target is Python code.", "No", "seq2seq translation", "phrase-based word embedding, Abstract Syntax Tree(AST)", "Python", "validation data"], "input": "Introduction\nRemoving computer-human language barrier is an inevitable advancement researchers are thriving to achieve for decades. One of the stages of this advancement will be coding through natural human language instead of traditional programming language. On naturalness of computer programming D. Knuth said, \u201cLet us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.\u201dBIBREF0. Unfortunately, learning programming language is still necessary to instruct it. Researchers and developers are working to overcome this human-machine language barrier. Multiple branches exists to solve this challenge (i.e. inter-conversion of different programming language to have universally connected programming languages). Automatic code generation through natural language is not a new concept in computer science studies. However, it is difficult to create such tool due to these following three reasons\u2013\nProgramming languages are diverse\nAn individual person expresses logical statements differently than other\nNatural Language Processing (NLP) of programming statements is challenging since both human and programming language evolve over time\nIn this paper, a neural approach to translate pseudo-code or algorithm like human language expression into programming language code is proposed.\nProblem Description\nCode repositories (i.e. Git, SVN) flourished in the last decade producing big data of code allowing data scientists to perform machine learning on these data. In 2017, Allamanis M et al. published a survey in which they presented the state-of-the-art of the research areas where machine learning is changing the way programmers code during software engineering and development process BIBREF1. This paper discusses what are the restricting factors of developing such text-to-code conversion method and what problems need to be solved\u2013\nProblem Description ::: Programming Language Diversity\nAccording to the sources, there are more than a thousand actively maintained programming languages, which signifies the diversity of these language . These languages were created to achieve different purpose and use different syntaxes. Low-level languages such as assembly languages are easier to express in human language because of the low or no abstraction at all whereas high-level, or Object-Oriented Programing (OOP) languages are more diversified in syntax and expression, which is challenging to bring into a unified human language structure. Nonetheless, portability and transparency between different programming languages also remains a challenge and an open research area. George D. et al. tried to overcome this problem through XML mapping BIBREF2. They tried to convert codes from C++ to Java using XML mapping as an intermediate language. However, the authors encountered challenges to support different features of both languages.\nProblem Description ::: Human Language Factor\nOne of the motivations behind this paper is - as long as it is about programming, there is a finite and small set of expression which is used in human vocabulary. For instance, programmers express a for-loop in a very few specific ways BIBREF3. Variable declaration and value assignment expressions are also limited in nature. Although all codes are executable, human representation through text may not due to the semantic brittleness of code. Since high-level languages have a wide range of syntax, programmers use different linguistic expressions to explain those. For instance, small changes like swapping function arguments can significantly change the meaning of the code. Hence the challenge remains in processing human language to understand it properly which brings us to the next problem-\nProblem Description ::: NLP of statements\nAlthough there is a finite set of expressions for each programming statements, it is a challenge to extract information from the statements of the code accurately. Semantic analysis of linguistic expression plays an important role in this information extraction. For instance, in case of a loop, what is the initial value? What is the step value? When will the loop terminate?\nMihalcea R. et al. has achieved a variable success rate of 70-80% in producing code just from the problem statement expressed in human natural language BIBREF3. They focused solely on the detection of step and loops in their research. Another research group from MIT, Lei et al. use a semantic learning model for text to detect the inputs. The model produces a parser in C++ which can successfully parse more than 70% of the textual description of input BIBREF4. The test dataset and model was initially tested and targeted against ACM-ICPC participants\u00ednputs which contains diverse and sometimes complex input instructions.\nA recent survey from Allamanis M. et al. presented the state-of-the-art on the area of naturalness of programming BIBREF1. A number of research works have been conducted on text-to-code or code-to-text area in recent years. In 2015, Oda et al. proposed a way to translate each line of Python code into natural language pseudocode using Statistical Machine Learning Technique (SMT) framework BIBREF5 was used. This translation framework was able to - it can successfully translate the code to natural language pseudo coded text in both English and Japanese. In the same year, Chris Q. et al. mapped natural language with simple if-this-then-that logical rules BIBREF6. Tihomir G. and Viktor K. developed an Integrated Development Environment (IDE) integrated code assistant tool anyCode for Java which can search, import and call function just by typing desired functionality through text BIBREF7. They have used model and mapping framework between function signatures and utilized resources like WordNet, Java Corpus, relational mapping to process text online and offline.\nRecently in 2017, P. Yin and G. Neubig proposed a semantic parser which generates code through its neural model BIBREF8. They formulated a grammatical model which works as a skeleton for neural network training. The grammatical rules are defined based on the various generalized structure of the statements in the programming language.\nProposed Methodology\nThe use of machine learning techniques such as SMT proved to be at most 75% successful in converting human text to executable code. BIBREF9. A programming language is just like a language with less vocabulary compared to a typical human language. For instance, the code vocabulary of the training dataset was 8814 (including variable, function, class names), whereas the English vocabulary to express the same code was 13659 in total. Here, programming language is considered just like another human language and widely used SMT techniques have been applied.\nProposed Methodology ::: Statistical Machine Translation\nSMT techniques are widely used in Natural Language Processing (NLP). SMT plays a significant role in translation from one language to another, especially in lexical and grammatical rule extraction. In SMT, bilingual grammatical structures are automatically formed by statistical approaches instead of explicitly providing a grammatical model. This reduces months and years of work which requires significant collaboration between bi-lingual linguistics. Here, a neural network based machine translation model is used to translate regular text into programming code.\nProposed Methodology ::: Statistical Machine Translation ::: Data Preparation\nSMT techniques require a parallel corpus in thr source and thr target language. A text-code parallel corpus similar to Fig. FIGREF12 is used in training. This parallel corpus has 18805 aligned data in it . In source data, the expression of each line code is written in the English language. In target data, the code is written in Python programming language.\nProposed Methodology ::: Statistical Machine Translation ::: Vocabulary Generation\nTo train the neural model, the texts should be converted to a computational entity. To do that, two separate vocabulary files are created - one for the source texts and another for the code. Vocabulary generation is done by tokenization of words. Afterwards, the words are put into their contextual vector space using the popular word2vec BIBREF10 method to make the words computational.\nProposed Methodology ::: Statistical Machine Translation ::: Neural Model Training\nIn order to train the translation model between text-to-code an open source Neural Machine Translation (NMT) - OpenNMT implementation is utilized BIBREF11. PyTorch is used as Neural Network coding framework. For training, three types of Recurrent Neural Network (RNN) layers are used \u2013 an encoder layer, a decoder layer and an output layer. These layers together form a LSTM model. LSTM is typically used in seq2seq translation.\nIn Fig. FIGREF13, the neural model architecture is demonstrated. The diagram shows how it takes the source and target text as input and uses it for training. Vector representation of tokenized source and target text are fed into the model. Each token of the source text is passed into an encoder cell. Target text tokens are passed into a decoder cell. Encoder cells are part of the encoder RNN layer and decoder cells are part of the decoder RNN layer. End of the input sequence is marked by a $<$eos$>$ token. Upon getting the $<$eos$>$ token, the final cell state of encoder layer initiate the output layer sequence. At each target cell state, attention is applied with the encoder RNN state and combined with the current hidden state to produce the prediction of next target token. This predictions are then fed back to the target RNN. Attention mechanism helps us to overcome the fixed length restriction of encoder-decoder sequence and allows us to process variable length between input and output sequence. Attention uses encoder state and pass it to the decoder cell to give particular attention to the start of an output layer sequence. The encoder uses an initial state to tell the decoder what it is supposed to generate. Effectively, the decoder learns to generate target tokens, conditioned on the input sequence. Sigmoidal optimization is used to optimize the prediction.\nResult Analysis\nTraining parallel corpus had 18805 lines of annotated code in it. The training model is executed several times with different training parameters. During the final training process, 500 validation data is used to generate the recurrent neural model, which is 3% of the training data. We run the training with epoch value of 10 with a batch size of 64. After finishing the training, the accuracy of the generated model using validation data from the source corpus was 74.40% (Fig. FIGREF17).\nAlthough the generated code is incoherent and often predict wrong code token, this is expected because of the limited amount of training data. LSTM generally requires a more extensive set of data (100k+ in such scenario) to build a more accurate model. The incoherence can be resolved by incorporating coding syntax tree model in future. For instance\u2013\n\"define the method tzname with 2 arguments: self and dt.\"\nis translated into\u2013\ndef __init__ ( self , regex ) :.\nThe translator is successfully generating the whole codeline automatically but missing the noun part (parameter and function name) part of the syntax.\nConclusion & Future Works\nThe main advantage of translating to a programming language is - it has a concrete and strict lexical and grammatical structure which human languages lack. The aim of this paper was to make the text-to-code framework work for general purpose programming language, primarily Python. In later phase, phrase-based word embedding can be incorporated for improved vocabulary mapping. To get more accurate target code for each line, Abstract Syntax Tree(AST) can be beneficial.\nThe contribution of this research is a machine learning model which can turn the human expression into coding expressions. This paper also discusses available methods which convert natural language to programming language successfully in fixed or tightly bounded linguistic paradigm. Approaching this problem using machine learning will give us the opportunity to explore the possibility of unified programming interface as well in the future.\nAcknowledgment\nWe would like to thank Dr. Khandaker Tabin Hasan, Head of the Depertment of Computer Science, American International University-Bangladesh for his inspiration and encouragement in all of our research works. Also, thanks to Future Technology Conference - 2019 committee for partially supporting us to join the conference and one of our colleague - Faheem Abrar, Software Developer for his thorough review and comments on this research work and supporting us by providing fund.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What corpus was the source of the OpenIE extractions?", "What is the accuracy of the proposed technique?", "Is an entity linking process used?", "Are the OpenIE extractions all triples?", "What method was used to generate the OpenIE extractions?", "Can the method answer multi-hop questions?", "What was the textual source to which OpenIE was applied?", "What OpenIE method was used to generate the extractions?", "Is their method capable of multi-hop reasoning?"], "outputs": ["domain-targeted $~$ 80K sentences and 280 GB of plain text extracted from web pages used by BIBREF6 aristo2016:combining", "51.7 and 51.6 on 4th and 8th grade question sets with no curated knowledge. 47.5 and 48.0 on 4th and 8th grade question sets when both solvers are given the same knowledge", "No", "No", "for each multiple-choice question $(q,A) \\in Q_\\mathit {tr}$ and each choice $a \\in A$ , we use all non-stopword tokens in $q$ and $a$ as an ElasticSearch query against S, take the top 200 hits, run Open IE v4, and aggregate the resulting tuples over all $a \\in A$ and over all questions in $Q_\\mathit {tr}$", "Yes", "domain-targeted $~$ 80K sentences and 280 GB of plain text extracted from web pages used by BIBREF6 aristo2016:combining", "for each multiple-choice question $(q,A) \\in Q_\\mathit {tr}$ and each choice $a \\in A$ , we use all non-stopword tokens in $q$ and $a$ as an ElasticSearch query against S, take the top 200 hits, run Open IE v4, and aggregate the resulting tuples over all $a \\in A$ and over all questions in $Q_\\mathit {tr}$", "Yes"], "input": "Introduction\nEffective question answering (QA) systems have been a long-standing quest of AI research. Structured curated KBs have been used successfully for this task BIBREF0 , BIBREF1 . However, these KBs are expensive to build and typically domain-specific. Automatically constructed open vocabulary (subject; predicate; object) style tuples have broader coverage, but have only been used for simple questions where a single tuple suffices BIBREF2 , BIBREF3 .\nOur goal in this work is to develop a QA system that can perform reasoning with Open IE BIBREF4 tuples for complex multiple-choice questions that require tuples from multiple sentences. Such a system can answer complex questions in resource-poor domains where curated knowledge is unavailable. Elementary-level science exams is one such domain, requiring complex reasoning BIBREF5 . Due to the lack of a large-scale structured KB, state-of-the-art systems for this task either rely on shallow reasoning with large text corpora BIBREF6 , BIBREF7 or deeper, structured reasoning with a small amount of automatically acquired BIBREF8 or manually curated BIBREF9 knowledge.\nConsider the following question from an Alaska state 4th grade science test:\nWhich object in our solar system reflects light and is a satellite that orbits around one planet? (A) Earth (B) Mercury (C) the Sun (D) the Moon\nThis question is challenging for QA systems because of its complex structure and the need for multi-fact reasoning. A natural way to answer it is by combining facts such as (Moon; is; in the solar system), (Moon; reflects; light), (Moon; is; satellite), and (Moon; orbits; around one planet).\nA candidate system for such reasoning, and which we draw inspiration from, is the TableILP system of BIBREF9 . TableILP treats QA as a search for an optimal subgraph that connects terms in the question and answer via rows in a set of curated tables, and solves the optimization problem using Integer Linear Programming (ILP). We similarly want to search for an optimal subgraph. However, a large, automatically extracted tuple KB makes the reasoning context different on three fronts: (a) unlike reasoning with tables, chaining tuples is less important and reliable as join rules aren't available; (b) conjunctive evidence becomes paramount, as, unlike a long table row, a single tuple is less likely to cover the entire question; and (c) again, unlike table rows, tuples are noisy, making combining redundant evidence essential. Consequently, a table-knowledge centered inference model isn't the best fit for noisy tuples.\nTo address this challenge, we present a new ILP-based model of inference with tuples, implemented in a reasoner called TupleInf. We demonstrate that TupleInf significantly outperforms TableILP by 11.8% on a broad set of over 1,300 science questions, without requiring manually curated tables, using a substantially simpler ILP formulation, and generalizing well to higher grade levels. The gains persist even when both solvers are provided identical knowledge. This demonstrates for the first time how Open IE based QA can be extended from simple lookup questions to an effective system for complex questions.\nRelated Work\nWe discuss two classes of related work: retrieval-based web question-answering (simple reasoning with large scale KB) and science question-answering (complex reasoning with small KB).\nTuple Inference Solver\nWe first describe the tuples used by our solver. We define a tuple as (subject; predicate; objects) with zero or more objects. We refer to the subject, predicate, and objects as the fields of the tuple.\nTuple KB\nWe use the text corpora (S) from BIBREF6 aristo2016:combining to build our tuple KB. For each test set, we use the corresponding training questions $Q_\\mathit {tr}$ to retrieve domain-relevant sentences from S. Specifically, for each multiple-choice question $(q,A) \\in Q_\\mathit {tr}$ and each choice $a \\in A$ , we use all non-stopword tokens in $q$ and $a$ as an ElasticSearch query against S. We take the top 200 hits, run Open IE v4, and aggregate the resulting tuples over all $a \\in A$ and over all questions in $Q_\\mathit {tr}$ to create the tuple KB (T).\nTuple Selection\nGiven a multiple-choice question $qa$ with question text $q$ and answer choices A= $\\lbrace a_i\\rbrace $ , we select the most relevant tuples from $T$ and $S$ as follows.\nSelecting from Tuple KB: We use an inverted index to find the 1,000 tuples that have the most overlapping tokens with question tokens $tok(qa).$ . We also filter out any tuples that overlap only with $tok(q)$ as they do not support any answer. We compute the normalized TF-IDF score treating the question, $q$ as a query and each tuple, $t$ as a document: $ &\\textit {tf}(x, q)=1\\; \\textmd {if x} \\in q ; \\textit {idf}(x) = log(1 + N/n_x) \\\\ &\\textit {tf-idf}(t, q)=\\sum _{x \\in t\\cap q} idf(x) $\nwhere $N$ is the number of tuples in the KB and $n_x$ are the number of tuples containing $x$ . We normalize the tf-idf score by the number of tokens in $t$ and $q$ . We finally take the 50 top-scoring tuples $T_{qa}$ .\nOn-the-fly tuples from text: To handle questions from new domains not covered by the training set, we extract additional tuples on the fly from S (similar to BIBREF17 knowlhunting). We perform the same ElasticSearch query described earlier for building T. We ignore sentences that cover none or all answer choices as they are not discriminative. We also ignore long sentences ( $>$ 300 characters) and sentences with negation as they tend to lead to noisy inference. We then run Open IE on these sentences and re-score the resulting tuples using the Jaccard score due to the lossy nature of Open IE, and finally take the 50 top-scoring tuples $T^{\\prime }_{qa}$ .\nSupport Graph Search\nSimilar to TableILP, we view the QA task as searching for a graph that best connects the terms in the question (qterms) with an answer choice via the knowledge; see Figure 1 for a simple illustrative example. Unlike standard alignment models used for tasks such as Recognizing Textual Entailment (RTE) BIBREF18 , however, we must score alignments between a set $T_{qa} \\cup T^{\\prime }_{qa}$ of structured tuples and a (potentially multi-sentence) multiple-choice question $qa$ .\nThe qterms, answer choices, and tuples fields form the set of possible vertices, $\\mathcal {V}$ , of the support graph. Edges connecting qterms to tuple fields and tuple fields to answer choices form the set of possible edges, $\\mathcal {E}$ . The support graph, $G(V, E)$ , is a subgraph of $\\mathcal {G}(\\mathcal {V}, \\mathcal {E})$ where $V$ and $E$ denote \u201cactive\u201d nodes and edges, resp. We define the desired behavior of an optimal support graph via an ILP model as follows.\nSimilar to TableILP, we score the support graph based on the weight of the active nodes and edges. Each edge $e(t, h)$ is weighted based on a word-overlap score. While TableILP used WordNet BIBREF19 paths to compute the weight, this measure results in unreliable scores when faced with longer phrases found in Open IE tuples.\nCompared to a curated KB, it is easy to find Open IE tuples that match irrelevant parts of the questions. To mitigate this issue, we improve the scoring of qterms in our ILP objective to focus on important terms. Since the later terms in a question tend to provide the most critical information, we scale qterm coefficients based on their position. Also, qterms that appear in almost all of the selected tuples tend not to be discriminative as any tuple would support such a qterm. Hence we scale the coefficients by the inverse frequency of the tokens in the selected tuples.\nSince Open IE tuples do not come with schema and join rules, we can define a substantially simpler model compared to TableILP. This reduces the reasoning capability but also eliminates the reliance on hand-authored join rules and regular expressions used in TableILP. We discovered (see empirical evaluation) that this simple model can achieve the same score as TableILP on the Regents test (target test set used by TableILP) and generalizes better to different grade levels.\nWe define active vertices and edges using ILP constraints: an active edge must connect two active vertices and an active vertex must have at least one active edge. To avoid positive edge coefficients in the objective function resulting in spurious edges in the support graph, we limit the number of active edges from an active tuple, question choice, tuple fields, and qterms (first group of constraints in Table 1 ). Our model is also capable of using multiple tuples to support different parts of the question as illustrated in Figure 1 . To avoid spurious tuples that only connect with the question (or choice) or ignore the relation being expressed in the tuple, we add constraints that require each tuple to connect a qterm with an answer choice (second group of constraints in Table 1 ).\nWe also define new constraints based on the Open IE tuple structure. Since an Open IE tuple expresses a fact about the tuple's subject, we require the subject to be active in the support graph. To avoid issues such as (Planet; orbit; Sun) matching the sample question in the introduction (\u201cWhich object $\\ldots $ orbits around a planet\u201d), we also add an ordering constraint (third group in Table 1 ).\nIts worth mentioning that TupleInf only combines parallel evidence i.e. each tuple must connect words in the question to the answer choice. For reliable multi-hop reasoning using OpenIE tuples, we can add inter-tuple connections to the support graph search, controlled by a small number of rules over the OpenIE predicates. Learning such rules for the Science domain is an open problem and potential avenue of future work.\nExperiments\nComparing our method with two state-of-the-art systems for 4th and 8th grade science exams, we demonstrate that (a) TupleInf with only automatically extracted tuples significantly outperforms TableILP with its original curated knowledge as well as with additional tuples, and (b) TupleInf's complementary approach to IR leads to an improved ensemble. Numbers in bold indicate statistical significance based on the Binomial exact test BIBREF20 at $p=0.05$ .\nWe consider two question sets. (1) 4th Grade set (1220 train, 1304 test) is a 10x larger superset of the NY Regents questions BIBREF6 , and includes professionally written licensed questions. (2) 8th Grade set (293 train, 282 test) contains 8th grade questions from various states.\nWe consider two knowledge sources. The Sentence corpus (S) consists of domain-targeted $~$ 80K sentences and 280 GB of plain text extracted from web pages used by BIBREF6 aristo2016:combining. This corpus is used by the IR solver and also used to create the tuple KB T and on-the-fly tuples $T^{\\prime }_{qa}$ . Additionally, TableILP uses $\\sim $ 70 Curated tables (C) designed for 4th grade NY Regents exams.\nWe compare TupleInf with two state-of-the-art baselines. IR is a simple yet powerful information-retrieval baseline BIBREF6 that selects the answer option with the best matching sentence in a corpus. TableILP is the state-of-the-art structured inference baseline BIBREF9 developed for science questions.\nResults\nTable 2 shows that TupleInf, with no curated knowledge, outperforms TableILP on both question sets by more than 11%. The lower half of the table shows that even when both solvers are given the same knowledge (C+T), the improved selection and simplified model of TupleInf results in a statistically significant improvement. Our simple model, TupleInf(C + T), also achieves scores comparable to TableILP on the latter's target Regents questions (61.4% vs TableILP's reported 61.5%) without any specialized rules.\nTable 3 shows that while TupleInf achieves similar scores as the IR solver, the approaches are complementary (structured lossy knowledge reasoning vs. lossless sentence retrieval). The two solvers, in fact, differ on 47.3% of the training questions. To exploit this complementarity, we train an ensemble system BIBREF6 which, as shown in the table, provides a substantial boost over the individual solvers. Further, IR + TupleInf is consistently better than IR + TableILP. Finally, in combination with IR and the statistical association based PMI solver (that scores 54.1% by itself) of BIBREF6 aristo2016:combining, TupleInf achieves a score of 58.2% as compared to TableILP's ensemble score of 56.7% on the 4th grade set, again attesting to TupleInf's strength.\nError Analysis\nWe describe four classes of failures that we observed, and the future work they suggest.\nMissing Important Words: Which material will spread out to completely fill a larger container? (A)air (B)ice (C)sand (D)water\nIn this question, we have tuples that support water will spread out and fill a larger container but miss the critical word \u201ccompletely\u201d. An approach capable of detecting salient question words could help avoid that.\nLossy IE: Which action is the best method to separate a mixture of salt and water? ...\nThe IR solver correctly answers this question by using the sentence: Separate the salt and water mixture by evaporating the water. However, TupleInf is not able to answer this question as Open IE is unable to extract tuples from this imperative sentence. While the additional structure from Open IE is useful for more robust matching, converting sentences to Open IE tuples may lose important bits of information.\nBad Alignment: Which of the following gases is necessary for humans to breathe in order to live?(A) Oxygen(B) Carbon dioxide(C) Helium(D) Water vapor\nTupleInf returns \u201cCarbon dioxide\u201d as the answer because of the tuple (humans; breathe out; carbon dioxide). The chunk \u201cto breathe\u201d in the question has a high alignment score to the \u201cbreathe out\u201d relation in the tuple even though they have completely different meanings. Improving the phrase alignment can mitigate this issue.\nOut of scope: Deer live in forest for shelter. If the forest was cut down, which situation would most likely happen?...\nSuch questions that require modeling a state presented in the question and reasoning over the state are out of scope of our solver.\nConclusion\nWe presented a new QA system, TupleInf, that can reason over a large, potentially noisy tuple KB to answer complex questions. Our results show that TupleInf is a new state-of-the-art structured solver for elementary-level science that does not rely on curated knowledge and generalizes to higher grades. Errors due to lossy IE and misalignments suggest future work in incorporating context and distributional measures.\nAppendix: ILP Model Details\nTo build the ILP model, we first need to get the questions terms (qterm) from the question by chunking the question using an in-house chunker based on the postagger from FACTORIE.\nExperiment Details\nWe use the SCIP ILP optimization engine BIBREF21 to optimize our ILP model. To get the score for each answer choice $a_i$ , we force the active variable for that choice $x_{a_i}$ to be one and use the objective function value of the ILP model as the score. For evaluations, we use a 2-core 2.5 GHz Amazon EC2 linux machine with 16 GB RAM. To evaluate TableILP and TupleInf on curated tables and tuples, we converted them into the expected format of each solver as follows.\nUsing curated tables with TupleInf\nFor each question, we select the 7 best matching tables using the tf-idf score of the table w.r.t. the question tokens and top 20 rows from each table using the Jaccard similarity of the row with the question. (same as BIBREF9 tableilp2016). We then convert the table rows into the tuple structure using the relations defined by TableILP. For every pair of cells connected by a relation, we create a tuple with the two cells as the subject and primary object with the relation as the predicate. The other cells of the table are used as additional objects to provide context to the solver. We pick top-scoring 50 tuples using the Jaccard score.\nUsing Open IE tuples with TableILP\nWe create an additional table in TableILP with all the tuples in $T$ . Since TableILP uses fixed-length $(subject; predicate; object)$ triples, we need to map tuples with multiple objects to this format. For each object, $O_i$ in the input Open IE tuple $(S; P; O_1; O_2 \\ldots )$ , we add a triple $(S; P; O_i)$ to this table.", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What models do they propose?", "Are all tweets in English?", "How large is the dataset?", "What is the results of multimodal compared to unimodal models?", "What is author's opinion on why current multimodal models cannot outperform models analyzing only text?", "What metrics are used to benchmark the results?", "How is data collected, manual collection or Twitter api?", "How many tweats does MMHS150k contains, 150000?", "What unimodal detection models were used?", "What different models for multimodal detection were proposed?", "What annotations are available in the dataset - tweat used hate speach or not?"], "outputs": ["Feature Concatenation Model (FCM), Spatial Concatenation Model (SCM), Textual Kernels Model (TKM)", "Unanswerable", " $150,000$ tweets", "Unimodal LSTM vs Best Multimodal (FCM)\n- F score: 0.703 vs 0.704\n- AUC: 0.732 vs 0.734 \n- Mean Accuracy: 68.3 vs 68.4 ", "Noisy data, Complexity and diversity of multimodal relations, Small set of multimodal examples", "F-score, Area Under the ROC Curve (AUC), mean accuracy (ACC), Precision vs Recall plot, ROC curve (which plots the True Positive Rate vs the False Positive Rate)", "Twitter API", "$150,000$ tweets", " single layer LSTM with a 150-dimensional hidden state for hate / not hate classification", "Feature Concatenation Model (FCM), Spatial Concatenation Model (SCM), Textual Kernels Model (TKM)", "No attacks to any community,  racist, sexist, homophobic, religion based attacks, attacks to other communities"], "input": "Introduction\nSocial Media platforms such as Facebook, Twitter or Reddit have empowered individuals' voices and facilitated freedom of expression. However they have also been a breeding ground for hate speech and other types of online harassment. Hate speech is defined in legal literature as speech (or any form of expression) that expresses (or seeks to promote, or has the capacity to increase) hatred against a person or a group of people because of a characteristic they share, or a group to which they belong BIBREF0. Twitter develops this definition in its hateful conduct policy as violence against or directly attack or threaten other people on the basis of race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or serious disease.\nIn this work we focus on hate speech detection. Due to the inherent complexity of this task, it is important to distinguish hate speech from other types of online harassment. In particular, although it might be offensive to many people, the sole presence of insulting terms does not itself signify or convey hate speech. And, the other way around, hate speech may denigrate or threaten an individual or a group of people without the use of any profanities. People from the african-american community, for example, often use the term nigga online, in everyday language, without malicious intentions to refer to folks within their community, and the word cunt is often used in non hate speech publications and without any sexist purpose. The goal of this work is not to discuss if racial slur, such as nigga, should be pursued. The goal is to distinguish between publications using offensive terms and publications attacking communities, which we call hate speech.\nModern social media content usually include images and text. Some of these multimodal publications are only hate speech because of the combination of the text with a certain image. That is because, as we have stated, the presence of offensive terms does not itself signify hate speech, and the presence of hate speech is often determined by the context of a publication. Moreover, users authoring hate speech tend to intentionally construct publications where the text is not enough to determine they are hate speech. This happens especially in Twitter, where multimodal tweets are formed by an image and a short text, which in many cases is not enough to judge them. In those cases, the image might give extra context to make a proper judgement. Fig. FIGREF5 shows some of such examples in MMHS150K.\nThe contributions of this work are as follows:\n[noitemsep,leftmargin=*]\nWe propose the novel task of hate speech detection in multimodal publications, collect, annotate and publish a large scale dataset.\nWe evaluate state of the art multimodal models on this specific task and compare their performance with unimodal detection. Even though images are proved to be useful for hate speech detection, the proposed multimodal models do not outperform unimodal textual models.\nWe study the challenges of the proposed task, and open the field for future research.\nRelated Work ::: Hate Speech Detection\nThe literature on detecting hate speech on online textual publications is extensive. Schmidt and Wiegand BIBREF1 recently provided a good survey of it, where they review the terminology used over time, the features used, the existing datasets and the different approaches. However, the field lacks a consistent dataset and evaluation protocol to compare proposed methods. Saleem et al. BIBREF2 compare different classification methods detecting hate speech in Reddit and other forums. Wassem and Hovy BIBREF3 worked on hate speech detection on twitter, published a manually annotated dataset and studied its hate distribution. Later Wassem BIBREF4 extended the previous published dataset and compared amateur and expert annotations, concluding that amateur annotators are more likely than expert annotators to label items as hate speech. Park and Fung BIBREF5 worked on Wassem datasets and proposed a classification method using a CNN over Word2Vec BIBREF6 word embeddings, showing also classification results on racism and sexism hate sub-classes. Davidson et al. BIBREF7 also worked on hate speech detection on twitter, publishing another manually annotated dataset. They test different classifiers such as SVMs and decision trees and provide a performance comparison. Malmasi and Zampieri BIBREF8 worked on Davidson's dataset improving his results using more elaborated features. ElSherief et al. BIBREF9 studied hate speech on twitter and selected the most frequent terms in hate tweets based on Hatebase, a hate expression repository. They propose a big hate dataset but it lacks manual annotations, and all the tweets containing certain hate expressions are considered hate speech. Zhang et al. BIBREF10 recently proposed a more sophisticated approach for hate speech detection, using a CNN and a GRU BIBREF11 over Word2Vec BIBREF6 word embeddings. They show experiments in different datasets outperforming previous methods. Next, we summarize existing hate speech datasets:\n[noitemsep,leftmargin=*]\nRM BIBREF10: Formed by $2,435$ tweets discussing Refugees and Muslims, annotated as hate or non-hate.\nDT BIBREF7: Formed by $24,783$ tweets annotated as hate, offensive language or neither. In our work, offensive language tweets are considered as non-hate.\nWZ-LS BIBREF5: A combination of Wassem datasets BIBREF4, BIBREF3 labeled as racism, sexism, neither or both that make a total of $18,624$ tweets.\nSemi-Supervised BIBREF9: Contains $27,330$ general hate speech Twitter tweets crawled in a semi-supervised manner.\nAlthough often modern social media publications include images, not too many contributions exist that exploit visual information. Zhong et al. BIBREF12 worked on classifying Instagram images as potential cyberbullying targets, exploiting both the image content, the image caption and the comments. However, their visual information processing is limited to the use of features extracted by a pre-trained CNN, the use of which does not achieve any improvement. Hosseinmardi et al. BIBREF13 also address the problem of detecting cyberbullying incidents on Instagram exploiting both textual and image content. But, again, their visual information processing is limited to use the features of a pre-trained CNN, and the improvement when using visual features on cyberbullying classification is only of 0.01%.\nRelated Work ::: Visual and Textual Data Fusion\nA typical task in multimodal visual and textual analysis is to learn an alignment between feature spaces. To do that, usually a CNN and a RNN are trained jointly to learn a joint embedding space from aligned multimodal data. This approach is applied in tasks such as image captioning BIBREF14, BIBREF15 and multimodal image retrieval BIBREF16, BIBREF17. On the other hand, instead of explicitly learning an alignment between two spaces, the goal of Visual Question Answering (VQA) is to merge both data modalities in order to decide which answer is correct. This problem requires modeling very precise correlations between the image and the question representations. The VQA task requirements are similar to our hate speech detection problem in multimodal publications, where we have a visual and a textual input and we need to combine both sources of information to understand the global context and make a decision. We thus take inspiration from the VQA literature for the tested models. Early VQA methods BIBREF18 fuse textual and visual information by feature concatenation. Later methods, such as Multimodal Compact Bilinear pooling BIBREF19, utilize bilinear pooling to learn multimodal features. An important limitation of these methods is that the multimodal features are fused in the latter model stage, so the textual and visual relationships are modeled only in the last layers. Another limitation is that the visual features are obtained by representing the output of the CNN as a one dimensional vector, which losses the spatial information of the input images. In a recent work, Gao et al. BIBREF20 propose a feature fusion scheme to overcome these limitations. They learn convolution kernels from the textual information \u2013which they call question-guided kernels\u2013 and convolve them with the visual information in an earlier stage to get the multimodal features. Margffoy-Tuay et al. BIBREF21 use a similar approach to combine visual and textual information, but they address a different task: instance segmentation guided by natural language queries. We inspire in these latest feature fusion works to build the models for hate speech detection.\nThe MMHS150K dataset\nExisting hate speech datasets contain only textual data. Moreover, a reference benchmark does not exists. Most of the published datasets are crawled from Twitter and distributed as tweet IDs but, since Twitter removes reported user accounts, an important amount of their hate tweets is no longer accessible. We create a new manually annotated multimodal hate speech dataset formed by $150,000$ tweets, each one of them containing text and an image. We call the dataset MMHS150K, and made it available online . In this section, we explain the dataset creation steps.\nThe MMHS150K dataset ::: Tweets Gathering\nWe used the Twitter API to gather real-time tweets from September 2018 until February 2019, selecting the ones containing any of the 51 Hatebase terms that are more common in hate speech tweets, as studied in BIBREF9. We filtered out retweets, tweets containing less than three words and tweets containing porn related terms. From that selection, we kept the ones that included images and downloaded them. Twitter applies hate speech filters and other kinds of content control based on its policy, although the supervision is based on users' reports. Therefore, as we are gathering tweets from real-time posting, the content we get has not yet passed any filter.\nThe MMHS150K dataset ::: Textual Image Filtering\nWe aim to create a multimodal hate speech database where all the instances contain visual and textual information that we can later process to determine if a tweet is hate speech or not. But a considerable amount of the images of the selected tweets contain only textual information, such as screenshots of other tweets. To ensure that all the dataset instances contain both visual and textual information, we remove those tweets. To do that, we use TextFCN BIBREF22, BIBREF23 , a Fully Convolutional Network that produces a pixel wise text probability map of an image. We set empirical thresholds to discard images that have a substantial total text probability, filtering out $23\\%$ of the collected tweets.\nThe MMHS150K dataset ::: Annotation\nWe annotate the gathered tweets using the crowdsourcing platform Amazon Mechanical Turk. There, we give the workers the definition of hate speech and show some examples to make the task clearer. We then show the tweet text and image and we ask them to classify it in one of 6 categories: No attacks to any community, racist, sexist, homophobic, religion based attacks or attacks to other communities. Each one of the $150,000$ tweets is labeled by 3 different workers to palliate discrepancies among workers.\nWe received a lot of valuable feedback from the annotators. Most of them had understood the task correctly, but they were worried because of its subjectivity. This is indeed a subjective task, highly dependent on the annotator convictions and sensitivity. However, we expect to get cleaner annotations the more strong the attack is, which are the publications we are more interested on detecting. We also detected that several users annotate tweets for hate speech just by spotting slur. As already said previously, just the use of particular words can be offensive to many people, but this is not the task we aim to solve. We have not included in our experiments those hits that were made in less than 3 seconds, understanding that it takes more time to grasp the multimodal context and make a decision.\nWe do a majority voting between the three annotations to get the tweets category. At the end, we obtain $112,845$ not hate tweets and $36,978$ hate tweets. The latest are divided in $11,925$ racist, $3,495$ sexist, $3,870$ homophobic, 163 religion-based hate and $5,811$ other hate tweets (Fig. FIGREF17). In this work, we do not use hate sub-categories, and stick to the hate / not hate split. We separate balanced validation ($5,000$) and test ($10,000$) sets. The remaining tweets are used for training.\nWe also experimented using hate scores for each tweet computed given the different votes by the three annotators instead of binary labels. The results did not present significant differences to those shown in the experimental part of this work, but the raw annotations will be published nonetheless for further research.\nAs far as we know, this dataset is the biggest hate speech dataset to date, and the first multimodal hate speech dataset. One of its challenges is to distinguish between tweets using the same key offensive words that constitute or not an attack to a community (hate speech). Fig. FIGREF18 shows the percentage of hate and not hate tweets of the top keywords.\nMethodology ::: Unimodal Treatment ::: Images.\nAll images are resized such that their shortest size has 500 pixels. During training, online data augmentation is applied as random cropping of $299\\times 299$ patches and mirroring. We use a CNN as the image features extractor which is an Imagenet BIBREF24 pre-trained Google Inception v3 architecture BIBREF25. The fine-tuning process of the Inception v3 layers aims to modify its weights to extract the features that, combined with the textual information, are optimal for hate speech detection.\nMethodology ::: Unimodal Treatment ::: Tweet Text.\nWe train a single layer LSTM with a 150-dimensional hidden state for hate / not hate classification. The input dimensionality is set to 100 and GloVe BIBREF26 embeddings are used as word input representations. Since our dataset is not big enough to train a GloVe word embedding model, we used a pre-trained model that has been trained in two billion tweets. This ensures that the model will be able to produce word embeddings for slang and other words typically used in Twitter. To process the tweets text before generating the word embeddings, we use the same pipeline as the model authors, which includes generating symbols to encode Twitter special interactions such as user mentions (@user) or hashtags (#hashtag). To encode the tweet text and input it later to multimodal models, we use the LSTM hidden state after processing the last tweet word. Since the LSTM has been trained for hate speech classification, it extracts the most useful information for this task from the text, which is encoded in the hidden state after inputting the last tweet word.\nMethodology ::: Unimodal Treatment ::: Image Text.\nThe text in the image can also contain important information to decide if a publication is hate speech or not, so we extract it and also input it to our model. To do so, we use Google Vision API Text Detection module BIBREF27. We input the tweet text and the text from the image separately to the multimodal models, so it might learn different relations between them and between them and the image. For instance, the model could learn to relate the image text with the area in the image where the text appears, so it could learn to interpret the text in a different way depending on the location where it is written in the image. The image text is also encoded by the LSTM as the hidden state after processing its last word.\nMethodology ::: Multimodal Architectures\nThe objective of this work is to build a hate speech detector that leverages both textual and visual data and detects hate speech publications based on the context given by both data modalities. To study how the multimodal context can boost the performance compared to an unimodal context we evaluate different models: a Feature Concatenation Model (FCM), a Spatial Concatenation Model (SCM) and a Textual Kernels Model (TKM). All of them are CNN+RNN models with three inputs: the tweet image, the tweet text and the text appearing in the image (if any).\nMethodology ::: Multimodal Architectures ::: Feature Concatenation Model (FCM)\nThe image is fed to the Inception v3 architecture and the 2048 dimensional feature vector after the last average pooling layer is used as the visual representation. This vector is then concatenated with the 150 dimension vectors of the LSTM last word hidden states of the image text and the tweet text, resulting in a 2348 feature vector. This vector is then processed by three fully connected layers of decreasing dimensionality $(2348, 1024, 512)$ with following corresponding batch normalization and ReLu layers until the dimensions are reduced to two, the number of classes, in the last classification layer. The FCM architecture is illustrated in Fig. FIGREF26.\nMethodology ::: Multimodal Architectures ::: Spatial Concatenation Model (SCM)\nInstead of using the latest feature vector before classification of the Inception v3 as the visual representation, in the SCM we use the $8\\times 8\\times 2048$ feature map after the last Inception module. Then we concatenate the 150 dimension vectors encoding the tweet text and the tweet image text at each spatial location of that feature map. The resulting multimodal feature map is processed by two Inception-E blocks BIBREF28. After that, dropout and average pooling are applied and, as in the FCM model, three fully connected layers are used to reduce the dimensionality until the classification layer.\nMethodology ::: Multimodal Architectures ::: Textual Kernels Model (TKM)\nThe TKM design, inspired by BIBREF20 and BIBREF21, aims to capture interactions between the two modalities more expressively than concatenation models. As in SCM we use the $8\\times 8\\times 2048$ feature map after the last Inception module as the visual representation. From the 150 dimension vector encoding the tweet text, we learn $K_t$ text dependent kernels using independent fully connected layers that are trained together with the rest of the model. The resulting $K_t$ text dependent kernels will have dimensionality of $1\\times 1\\times 2048$. We do the same with the feature vector encoding the image text, learning $K_{it}$ kernels. The textual kernels are convolved with the visual feature map in the channel dimension at each spatial location, resulting in a $8\\times 8\\times (K_i+K_{it})$ multimodal feature map, and batch normalization is applied. Then, as in the SCM, the 150 dimension vectors encoding the tweet text and the tweet image text are concatenated at each spatial dimension. The rest of the architecture is the same as in SCM: two Inception-E blocks, dropout, average pooling and three fully connected layers until the classification layer. The number of tweet textual kernels $K_t$ and tweet image textual kernels $K_it$ is set to $K_t = 10$ and $K_it = 5$. The TKM architecture is illustrated in Fig. FIGREF29.\nMethodology ::: Multimodal Architectures ::: Training\nWe train the multimodal models with a Cross-Entropy loss with Softmax activations and an ADAM optimizer with an initial learning rate of $1e-4$. Our dataset suffers from a high class imbalance, so we weight the contribution to the loss of the samples to totally compensate for it. One of the goals of this work is to explore how every one of the inputs contributes to the classification and to prove that the proposed model can learn concurrences between visual and textual data useful to improve the hate speech classification results on multimodal data. To do that we train different models where all or only some inputs are available. When an input is not available, we set it to zeros, and we do the same when an image has no text.\nResults\nTable TABREF31 shows the F-score, the Area Under the ROC Curve (AUC) and the mean accuracy (ACC) of the proposed models when different inputs are available. $TT$ refers to the tweet text, $IT$ to the image text and $I$ to the image. It also shows results for the LSTM, for the Davison method proposed in BIBREF7 trained with MMHS150K, and for random scores. Fig. FIGREF32 shows the Precision vs Recall plot and the ROC curve (which plots the True Positive Rate vs the False Positive Rate) of the different models.\nFirst, notice that given the subjectivity of the task and the discrepancies between annotators, getting optimal scores in the evaluation metrics is virtually impossible. However, a system with relatively low metric scores can still be very useful for hate speech detection in a real application: it will fire on publications for which most annotators agree they are hate, which are often the stronger attacks. The proposed LSTM to detect hate speech when only text is available, gets similar results as the method presented in BIBREF7, which we trained with MMHS150K and the same splits. However, more than substantially advancing the state of the art on hate speech detection in textual publications, our key purpose in this work is to introduce and work on its detection on multimodal publications. We use LSTM because it provides a strong representation of the tweet texts.\nThe FCM trained only with images gets decent results, considering that in many publications the images might not give any useful information for the task. Fig. FIGREF33 shows some representative examples of the top hate and not hate scored images of this model. Many hate tweets are accompanied by demeaning nudity images, being sexist or homophobic. Other racist tweets are accompanied by images caricaturing black people. Finally, MEMES are also typically used in hate speech publications. The top scored images for not hate are portraits of people belonging to minorities. This is due to the use of slur inside these communities without an offensive intention, such as the word nigga inside the afro-american community or the word dyke inside the lesbian community. These results show that images can be effectively used to discriminate between offensive and non-offensive uses of those words.\nDespite the model trained only with images proves that they are useful for hate speech detection, the proposed multimodal models are not able to improve the detection compared to the textual models. Besides the different architectures, we have tried different training strategies, such as initializing the CNN weights with a model already trained solely with MMHS150K images or using dropout to force the multimodal models to use the visual information. Eventually, though, these models end up using almost only the text input for the prediction and producing very similar results to those of the textual models. The proposed multimodal models, such as TKM, have shown good performance in other tasks, such as VQA. Next, we analyze why they do not perform well in this task and with this data:\n[noitemsep,leftmargin=*]\nNoisy data. A major challenge of this task is the discrepancy between annotations due to subjective judgement. Although this affects also detection using only text, its repercussion is bigger in more complex tasks, such as detection using images or multimodal detection.\nComplexity and diversity of multimodal relations. Hate speech multimodal publications employ a lot of background knowledge which makes the relations between visual and textual elements they use very complex and diverse, and therefore difficult to learn by a neural network.\nSmall set of multimodal examples. Fig. FIGREF5 shows some of the challenging multimodal hate examples that we aimed to detect. But although we have collected a big dataset of $150K$ tweets, the subset of multimodal hate there is still too small to learn the complex multimodal relations needed to identify multimodal hate.\nConclusions\nIn this work we have explored the task of hate speech detection on multimodal publications. We have created MMHS150K, to our knowledge the biggest available hate speech dataset, and the first one composed of multimodal data, namely tweets formed by image and text. We have trained different textual, visual and multimodal models with that data, and found out that, despite the fact that images are useful for hate speech detection, the multimodal models do not outperform the textual models. Finally, we have analyzed the challenges of the proposed task and dataset. Given that most of the content in Social Media nowadays is multimodal, we truly believe on the importance of pushing forward this research. The code used in this work is available in .", "source": "scientific_qa", "evaluation": "f1"}
{"instructions": ["What topic is covered in the Chinese Facebook data? ", "How many layers does the UTCNN model have?", "What topics are included in the debate data?", "What is the size of the Chinese data?", "Did they collect the two datasets?", "What are the baselines?"], "outputs": ["anti-nuclear-power", "eight layers", "abortion, gay rights, Obama, marijuana", "32,595", "No", "SVM with unigram, bigram, trigram features, with average word embedding, with average transformed word embeddings, CNN and RCNN, SVM, CNN, RCNN with comment information"], "input": "Introduction\nThis work is licenced under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ Deep neural networks have been widely used in text classification and have achieved promising results BIBREF0 , BIBREF1 , BIBREF2 . Most focus on content information and use models such as convolutional neural networks (CNN) BIBREF3 or recursive neural networks BIBREF4 . However, for user-generated posts on social media like Facebook or Twitter, there is more information that should not be ignored. On social media platforms, a user can act either as the author of a post or as a reader who expresses his or her comments about the post.\nIn this paper, we classify posts taking into account post authorship, likes, topics, and comments. In particular, users and their \u201clikes\u201d hold strong potential for text mining. For example, given a set of posts that are related to a specific topic, a user's likes and dislikes provide clues for stance labeling. From a user point of view, users with positive attitudes toward the issue leave positive comments on the posts with praise or even just the post's content; from a post point of view, positive posts attract users who hold positive stances. We also investigate the influence of topics: different topics are associated with different stance labeling tendencies and word usage. For example we discuss women's rights and unwanted babies on the topic of abortion, but we criticize medicine usage or crime when on the topic of marijuana BIBREF5 . Even for posts on a specific topic like nuclear power, a variety of arguments are raised: green energy, radiation, air pollution, and so on. As for comments, we treat them as additional text information. The arguments in the comments and the commenters (the users who leave the comments) provide hints on the post's content and further facilitate stance classification.\nIn this paper, we propose the user-topic-comment neural network (UTCNN), a deep learning model that utilizes user, topic, and comment information. We attempt to learn user and topic representations which encode user interactions and topic influences to further enhance text classification, and we also incorporate comment information. We evaluate this model on a post stance classification task on forum-style social media platforms. The contributions of this paper are as follows: 1. We propose UTCNN, a neural network for text in modern social media channels as well as legacy social media, forums, and message boards \u2014 anywhere that reveals users, their tastes, as well as their replies to posts. 2. When classifying social media post stances, we leverage users, including authors and likers. User embeddings can be generated even for users who have never posted anything. 3. We incorporate a topic model to automatically assign topics to each post in a single topic dataset. 4. We show that overall, the proposed method achieves the highest performance in all instances, and that all of the information extracted, whether users, topics, or comments, still has its contributions.\nExtra-Linguistic Features for Stance Classification\nIn this paper we aim to use text as well as other features to see how they complement each other in a deep learning model. In the stance classification domain, previous work has showed that text features are limited, suggesting that adding extra-linguistic constraints could improve performance BIBREF6 , BIBREF7 , BIBREF8 . For example, Hasan and Ng as well as Thomas et al. require that posts written by the same author have the same stance BIBREF9 , BIBREF10 . The addition of this constraint yields accuracy improvements of 1\u20137% for some models and datasets. Hasan and Ng later added user-interaction constraints and ideology constraints BIBREF7 : the former models the relationship among posts in a sequence of replies and the latter models inter-topic relationships, e.g., users who oppose abortion could be conservative and thus are likely to oppose gay rights.\nFor work focusing on online forum text, since posts are linked through user replies, sequential labeling methods have been used to model relationships between posts. For example, Hasan and Ng use hidden Markov models (HMMs) to model dependent relationships to the preceding post BIBREF9 ; Burfoot et al. use iterative classification to repeatedly generate new estimates based on the current state of knowledge BIBREF11 ; Sridhar et al. use probabilistic soft logic (PSL) to model reply links via collaborative filtering BIBREF12 . In the Facebook dataset we study, we use comments instead of reply links. However, as the ultimate goal in this paper is predicting not comment stance but post stance, we treat comments as extra information for use in predicting post stance.\nDeep Learning on Extra-Linguistic Features\nIn recent years neural network models have been applied to document sentiment classification BIBREF13 , BIBREF4 , BIBREF14 , BIBREF15 , BIBREF2 . Text features can be used in deep networks to capture text semantics or sentiment. For example, Dong et al. use an adaptive layer in a recursive neural network for target-dependent Twitter sentiment analysis, where targets are topics such as windows 7 or taylor swift BIBREF16 , BIBREF17 ; recursive neural tensor networks (RNTNs) utilize sentence parse trees to capture sentence-level sentiment for movie reviews BIBREF4 ; Le and Mikolov predict sentiment by using paragraph vectors to model each paragraph as a continuous representation BIBREF18 . They show that performance can thus be improved by more delicate text models.\nOthers have suggested using extra-linguistic features to improve the deep learning model. The user-word composition vector model (UWCVM) BIBREF19 is inspired by the possibility that the strength of sentiment words is user-specific; to capture this they add user embeddings in their model. In UPNN, a later extension, they further add a product-word composition as product embeddings, arguing that products can also show different tendencies of being rated or reviewed BIBREF20 . Their addition of user information yielded 2\u201310% improvements in accuracy as compared to the above-mentioned RNTN and paragraph vector methods. We also seek to inject user information into the neural network model. In comparison to the research of Tang et al. on sentiment classification for product reviews, the difference is two-fold. First, we take into account multiple users (one author and potentially many likers) for one post, whereas only one user (the reviewer) is involved in a review. Second, we add comment information to provide more features for post stance classification. None of these two factors have been considered previously in a deep learning model for text stance classification. Therefore, we propose UTCNN, which generates and utilizes user embeddings for all users \u2014 even for those who have not authored any posts \u2014 and incorporates comments to further improve performance.\nMethod\nIn this section, we first describe CNN-based document composition, which captures user- and topic-dependent document-level semantic representation from word representations. Then we show how to add comment information to construct the user-topic-comment neural network (UTCNN).\nUser- and Topic-dependent Document Composition\nAs shown in Figure FIGREF4 , we use a general CNN BIBREF3 and two semantic transformations for document composition . We are given a document with an engaged user INLINEFORM0 , a topic INLINEFORM1 , and its composite INLINEFORM2 words, each word INLINEFORM3 of which is associated with a word embedding INLINEFORM4 where INLINEFORM5 is the vector dimension. For each word embedding INLINEFORM6 , we apply two dot operations as shown in Equation EQREF6 : DISPLAYFORM0\nwhere INLINEFORM0 models the user reading preference for certain semantics, and INLINEFORM1 models the topic semantics; INLINEFORM2 and INLINEFORM3 are the dimensions of transformed user and topic embeddings respectively. We use INLINEFORM4 to model semantically what each user prefers to read and/or write, and use INLINEFORM5 to model the semantics of each topic. The dot operation of INLINEFORM6 and INLINEFORM7 transforms the global representation INLINEFORM8 to a user-dependent representation. Likewise, the dot operation of INLINEFORM9 and INLINEFORM10 transforms INLINEFORM11 to a topic-dependent representation.\nAfter the two dot operations on INLINEFORM0 , we have user-dependent and topic-dependent word vectors INLINEFORM1 and INLINEFORM2 , which are concatenated to form a user- and topic-dependent word vector INLINEFORM3 . Then the transformed word embeddings INLINEFORM4 are used as the CNN input. Here we apply three convolutional layers on the concatenated transformed word embeddings INLINEFORM5 : DISPLAYFORM0\nwhere INLINEFORM0 is the index of words; INLINEFORM1 is a non-linear activation function (we use INLINEFORM2 ); INLINEFORM5 is the convolutional filter with input length INLINEFORM6 and output length INLINEFORM7 , where INLINEFORM8 is the window size of the convolutional operation; and INLINEFORM9 and INLINEFORM10 are the output and bias of the convolution layer INLINEFORM11 , respectively. In our experiments, the three window sizes INLINEFORM12 in the three convolution layers are one, two, and three, encoding unigram, bigram, and trigram semantics accordingly.\nAfter the convolutional layer, we add a maximum pooling layer among convolutional outputs to obtain the unigram, bigram, and trigram n-gram representations. This is succeeded by an average pooling layer for an element-wise average of the three maximized convolution outputs.\nUTCNN Model Description\nFigure FIGREF10 illustrates the UTCNN model. As more than one user may interact with a given post, we first add a maximum pooling layer after the user matrix embedding layer and user vector embedding layer to form a moderator matrix embedding INLINEFORM0 and a moderator vector embedding INLINEFORM1 for moderator INLINEFORM2 respectively, where INLINEFORM3 is used for the semantic transformation in the document composition process, as mentioned in the previous section. The term moderator here is to denote the pseudo user who provides the overall semantic/sentiment of all the engaged users for one document. The embedding INLINEFORM4 models the moderator stance preference, that is, the pattern of the revealed user stance: whether a user is willing to show his preference, whether a user likes to show impartiality with neutral statements and reasonable arguments, or just wants to show strong support for one stance. Ideally, the latent user stance is modeled by INLINEFORM5 for each user. Likewise, for topic information, a maximum pooling layer is added after the topic matrix embedding layer and topic vector embedding layer to form a joint topic matrix embedding INLINEFORM6 and a joint topic vector embedding INLINEFORM7 for topic INLINEFORM8 respectively, where INLINEFORM9 models the semantic transformation of topic INLINEFORM10 as in users and INLINEFORM11 models the topic stance tendency. The latent topic stance is also modeled by INLINEFORM12 for each topic.\nAs for comments, we view them as short documents with authors only but without likers nor their own comments. Therefore we apply document composition on comments although here users are commenters (users who comment). It is noticed that the word embeddings INLINEFORM0 for the same word in the posts and comments are the same, but after being transformed to INLINEFORM1 in the document composition process shown in Figure FIGREF4 , they might become different because of their different engaged users. The output comment representation together with the commenter vector embedding INLINEFORM2 and topic vector embedding INLINEFORM3 are concatenated and a maximum pooling layer is added to select the most important feature for comments. Instead of requiring that the comment stance agree with the post, UTCNN simply extracts the most important features of the comment contents; they could be helpful, whether they show obvious agreement or disagreement. Therefore when combining comment information here, the maximum pooling layer is more appropriate than other pooling or merging layers. Indeed, we believe this is one reason for UTCNN's performance gains.\nFinally, the pooled comment representation, together with user vector embedding INLINEFORM0 , topic vector embedding INLINEFORM1 , and document representation are fed to a fully connected network, and softmax is applied to yield the final stance label prediction for the post.\nExperiment\nWe start with the experimental dataset and then describe the training process as well as the implementation of the baselines. We also implement several variations to reveal the effects of features: authors, likers, comment, and commenters. In the results section we compare our model with related work.\nDataset\nWe tested the proposed UTCNN on two different datasets: FBFans and CreateDebate. FBFans is a privately-owned, single-topic, Chinese, unbalanced, social media dataset, and CreateDebate is a public, multiple-topic, English, balanced, forum dataset. Results using these two datasets show the applicability and superiority for different topics, languages, data distributions, and platforms.\nThe FBFans dataset contains data from anti-nuclear-power Chinese Facebook fan groups from September 2013 to August 2014, including posts and their author and liker IDs. There are a total of 2,496 authors, 505,137 likers, 33,686 commenters, and 505,412 unique users. Two annotators were asked to take into account only the post content to label the stance of the posts in the whole dataset as supportive, neutral, or unsupportive (hereafter denoted as Sup, Neu, and Uns). Sup/Uns posts were those in support of or against anti-reconstruction; Neu posts were those evincing a neutral standpoint on the topic, or were irrelevant. Raw agreement between annotators is 0.91, indicating high agreement. Specifically, Cohen\u2019s Kappa for Neu and not Neu labeling is 0.58 (moderate), and for Sup or Uns labeling is 0.84 (almost perfect). Posts with inconsistent labels were filtered out, and the development and testing sets were randomly selected from what was left. Posts in the development and testing sets involved at least one user who appeared in the training set. The number of posts for each stance is shown on the left-hand side of Table TABREF12 . About twenty percent of the posts were labeled with a stance, and the number of supportive (Sup) posts was much larger than that of the unsupportive (Uns) ones: this is thus highly skewed data, which complicates stance classification. On average, 161.1 users were involved in one post. The maximum was 23,297 and the minimum was one (the author). For comments, on average there were 3 comments per post. The maximum was 1,092 and the minimum was zero.\nTo test whether the assumption of this paper \u2013 posts attract users who hold the same stance to like them \u2013 is reliable, we examine the likes from authors of different stances. Posts in FBFans dataset are used for this analysis. We calculate the like statistics of each distinct author from these 32,595 posts. As the numbers of authors in the Sup, Neu and Uns stances are largely imbalanced, these numbers are normalized by the number of users of each stance. Table TABREF13 shows the results. Posts with stances (i.e., not neutral) attract users of the same stance. Neutral posts also attract both supportive and neutral users, like what we observe in supportive posts, but just the neutral posts can attract even more neutral likers. These results do suggest that users prefer posts of the same stance, or at least posts of no obvious stance which might cause annoyance when reading, and hence support the user modeling in our approach.\nThe CreateDebate dataset was collected from an English online debate forum discussing four topics: abortion (ABO), gay rights (GAY), Obama (OBA), and marijuana (MAR). The posts are annotated as for (F) and against (A). Replies to posts in this dataset are also labeled with stance and hence use the same data format as posts. The labeling results are shown in the right-hand side of Table TABREF12 . We observe that the dataset is more balanced than the FBFans dataset. In addition, there are 977 unique users in the dataset. To compare with Hasan and Ng's work, we conducted five-fold cross-validation and present the annotation results as the average number of all folds BIBREF9 , BIBREF5 .\nThe FBFans dataset has more integrated functions than the CreateDebate dataset; thus our model can utilize all linguistic and extra-linguistic features. For the CreateDebate dataset, on the other hand, the like and comment features are not available (as there is a stance label for each reply, replies are evaluated as posts as other previous work) but we still implemented our model using the content, author, and topic information.\nSettings\nIn the UTCNN training process, cross-entropy was used as the loss function and AdaGrad as the optimizer. For FBFans dataset, we learned the 50-dimensional word embeddings on the whole dataset using GloVe BIBREF21 to capture the word semantics; for CreateDebate dataset we used the publicly available English 50-dimensional word embeddings, pre-trained also using GloVe. These word embeddings were fixed in the training process. The learning rate was set to 0.03. All user and topic embeddings were randomly initialized in the range of [-0.1 0.1]. Matrix embeddings for users and topics were sized at 250 ( INLINEFORM0 ); vector embeddings for users and topics were set to length 10.\nWe applied the LDA topic model BIBREF22 on the FBFans dataset to determine the latent topics with which to build topic embeddings, as there is only one general known topic: nuclear power plants. We learned 100 latent topics and assigned the top three topics for each post. For the CreateDebate dataset, which itself constitutes four topics, the topic labels for posts were used directly without additionally applying LDA.\nFor the FBFans data we report class-based f-scores as well as the macro-average f-score ( INLINEFORM0 ) shown in equation EQREF19 . DISPLAYFORM0\nwhere INLINEFORM0 and INLINEFORM1 are the average precision and recall of the three class. We adopted the macro-average f-score as the evaluation metric for the overall performance because (1) the experimental dataset is severely imbalanced, which is common for contentious issues; and (2) for stance classification, content in minor-class posts is usually more important for further applications. For the CreateDebate dataset, accuracy was adopted as the evaluation metric to compare the results with related work BIBREF7 , BIBREF9 , BIBREF12 .\nBaselines\nWe pit our model against the following baselines: 1) SVM with unigram, bigram, and trigram features, which is a standard yet rather strong classifier for text features; 2) SVM with average word embedding, where a document is represented as a continuous representation by averaging the embeddings of the composite words; 3) SVM with average transformed word embeddings (the INLINEFORM0 in equation EQREF6 ), where a document is represented as a continuous representation by averaging the transformed embeddings of the composite words; 4) two mature deep learning models on text classification, CNN BIBREF3 and Recurrent Convolutional Neural Networks (RCNN) BIBREF0 , where the hyperparameters are based on their work; 5) the above SVM and deep learning models with comment information; 6) UTCNN without user information, representing a pure-text CNN model where we use the same user matrix and user embeddings INLINEFORM1 and INLINEFORM2 for each user; 7) UTCNN without the LDA model, representing how UTCNN works with a single-topic dataset; 8) UTCNN without comments, in which the model predicts the stance label given only user and topic information. All these models were trained on the training set, and parameters as well as the SVM kernel selections (linear or RBF) were fine-tuned on the development set. Also, we adopt oversampling on SVMs, CNN and RCNN because the FBFans dataset is highly imbalanced.\nResults on FBFans Dataset\nIn Table TABREF22 we show the results of UTCNN and the baselines on the FBFans dataset. Here Majority yields good performance on Neu since FBFans is highly biased to the neutral class. The SVM models perform well on Sup and Neu but perform poorly for Uns, showing that content information in itself is insufficient to predict stance labels, especially for the minor class. With the transformed word embedding feature, SVM can achieve comparable performance as SVM with n-gram feature. However, the much fewer feature dimension of the transformed word embedding makes SVM with word embeddings a more efficient choice for modeling the large scale social media dataset. For the CNN and RCNN models, they perform slightly better than most of the SVM models but still, the content information is insufficient to achieve a good performance on the Uns posts. As to adding comment information to these models, since the commenters do not always hold the same stance as the author, simply adding comments and post contents together merely adds noise to the model.\nAmong all UTCNN variations, we find that user information is most important, followed by topic and comment information. UTCNN without user information shows results similar to SVMs \u2014 it does well for Sup and Neu but detects no Uns. Its best f-scores on both Sup and Neu among all methods show that with enough training data, content-based models can perform well; at the same time, the lack of user information results in too few clues for minor-class posts to either predict their stance directly or link them to other users and posts for improved performance. The 17.5% improvement when adding user information suggests that user information is especially useful when the dataset is highly imbalanced. All models that consider user information predict the minority class successfully. UCTNN without topic information works well but achieves lower performance than the full UTCNN model. The 4.9% performance gain brought by LDA shows that although it is satisfactory for single topic datasets, adding that latent topics still benefits performance: even when we are discussing the same topic, we use different arguments and supporting evidence. Lastly, we get 4.8% improvement when adding comment information and it achieves comparable performance to UTCNN without topic information, which shows that comments also benefit performance. For platforms where user IDs are pixelated or otherwise hidden, adding comments to a text model still improves performance. In its integration of user, content, and comment information, the full UTCNN produces the highest f-scores on all Sup, Neu, and Uns stances among models that predict the Uns class, and the highest macro-average f-score overall. This shows its ability to balance a biased dataset and supports our claim that UTCNN successfully bridges content and user, topic, and comment information for stance classification on social media text. Another merit of UTCNN is that it does not require a balanced training data. This is supported by its outperforming other models though no oversampling technique is applied to the UTCNN related experiments as shown in this paper. Thus we can conclude that the user information provides strong clues and it is still rich even in the minority class.\nWe also investigate the semantic difference when a user acts as an author/liker or a commenter. We evaluated a variation in which all embeddings from the same user were forced to be identical (this is the UTCNN shared user embedding setting in Table TABREF22 ). This setting yielded only a 2.5% improvement over the model without comments, which is not statistically significant. However, when separating authors/likers and commenters embeddings (i.e., the UTCNN full model), we achieved much greater improvements (4.8%). We attribute this result to the tendency of users to use different wording for different roles (for instance author vs commenter). This is observed when the user, acting as an author, attempts to support her argument against nuclear power by using improvements in solar power; when acting as a commenter, though, she interacts with post contents by criticizing past politicians who supported nuclear power or by arguing that the proposed evacuation plan in case of a nuclear accident is ridiculous. Based on this finding, in the final UTCNN setting we train two user matrix embeddings for one user: one for the author/liker role and the other for the commenter role.\nResults on CreateDebate Dataset\nTable TABREF24 shows the results of UTCNN, baselines as we implemented on the FBFans datset and related work on the CreateDebate dataset. We do not adopt oversampling on these models because the CreateDebate dataset is almost balanced. In previous work, integer linear programming (ILP) or linear-chain conditional random fields (CRFs) were proposed to integrate text features, author, ideology, and user-interaction constraints, where text features are unigram, bigram, and POS-dependencies; the author constraint tends to require that posts from the same author for the same topic hold the same stance; the ideology constraint aims to capture inferences between topics for the same author; the user-interaction constraint models relationships among posts via user interactions such as replies BIBREF7 , BIBREF9 .\nThe SVM with n-gram or average word embedding feature performs just similar to the majority. However, with the transformed word embedding, it achieves superior results. It shows that the learned user and topic embeddings really capture the user and topic semantics. This finding is not so obvious in the FBFans dataset and it might be due to the unfavorable data skewness for SVM. As for CNN and RCNN, they perform slightly better than most SVMs as we found in Table TABREF22 for FBFans.\nCompared to the ILP BIBREF7 and CRF BIBREF9 methods, the UTCNN user embeddings encode author and user-interaction constraints, where the ideology constraint is modeled by the topic embeddings and text features are modeled by the CNN. The significant improvement achieved by UTCNN suggests the latent representations are more effective than overt model constraints.\nThe PSL model BIBREF12 jointly labels both author and post stance using probabilistic soft logic (PSL) BIBREF23 by considering text features and reply links between authors and posts as in Hasan and Ng's work. Table TABREF24 reports the result of their best AD setting, which represents the full joint stance/disagreement collective model on posts and is hence more relevant to UTCNN. In contrast to their model, the UTCNN user embeddings represent relationships between authors, but UTCNN models do not utilize link information between posts. Though the PSL model has the advantage of being able to jointly label the stances of authors and posts, its performance on posts is lower than the that for the ILP or CRF models. UTCNN significantly outperforms these models on posts and has the potential to predict user stances through the generated user embeddings.\nFor the CreateDebate dataset, we also evaluated performance when not using topic embeddings or user embeddings; as replies in this dataset are viewed as posts, the setting without comment embeddings is not available. Table TABREF24 shows the same findings as Table TABREF22 : the 21% improvement in accuracy demonstrates that user information is the most vital. This finding also supports the results in the related work: user constraints are useful and can yield 11.2% improvement in accuracy BIBREF7 . Further considering topic information yields 3.4% improvement, suggesting that knowing the subject of debates provides useful information. In sum, Table TABREF22 together with Table TABREF24 show that UTCNN achieves promising performance regardless of topic, language, data distribution, and platform.\nConclusion\nWe have proposed UTCNN, a neural network model that incorporates user, topic, content and comment information for stance classification on social media texts. UTCNN learns user embeddings for all users with minimum active degree, i.e., one post or one like. Topic information obtained from the topic model or the pre-defined labels further improves the UTCNN model. In addition, comment information provides additional clues for stance classification. We have shown that UTCNN achieves promising and balanced results. In the future we plan to explore the effectiveness of the UTCNN user embeddings for author stance classification.\nAcknowledgements\nResearch of this paper was partially supported by Ministry of Science and Technology, Taiwan, under the contract MOST 104-2221-E-001-024-MY2.", "source": "scientific_qa", "evaluation": "LLM"}
{"instructions": ["How did they obtain the dataset?", "What activation function do they use in their model?", "What baselines do they compare to?", "How are chunks defined?", "What features are extracted?", "Was the approach used in this work to detect fake news fully supervised?", "Based on this paper, what is the more predictive set of features to detect fake news?", "How big is the dataset used in this work?", "How is a \"chunk of posts\" defined in this work?", "What baselines were used in this work?"], "outputs": ["public resources where suspicious Twitter accounts were annotated, list with another 32 Twitter accounts from BIBREF19 that are considered trustworthy", "relu, selu, tanh", "Top-$k$ replies, likes, or re-tweets, FacTweet (tweet-level), LR + All Features (chunk-level), LR + All Features (tweet-level), Tweet2vec, LR + Bag-of-words", "Chunks is group of tweets from single account that  is consecutive in time - idea is that this group can show secret intention of malicious accounts.", "Sentiment, Morality, Style, Words embeddings", "Yes", "words embeddings, style, and morality features", "Total dataset size: 171 account (522967 tweets)", "chunk consists of a sorted sequence of tweets labeled by the label of its corresponding account", "LR + Bag-of-words, Tweet2vec, LR + All Features (tweet-level), LR + All Features (chunk-level), FacTweet (tweet-level), Top-$k$ replies, likes, or re-tweets"], "input": "Introduction\nSocial media platforms have made the spreading of fake news easier, faster as well as able to reach a wider audience. Social media offer another feature which is the anonymity for the authors, and this opens the door to many suspicious individuals or organizations to utilize these platforms. Recently, there has been an increased number of spreading fake news and rumors over the web and social media BIBREF0. Fake news in social media vary considering the intention to mislead. Some of these news are spread with the intention to be ironic or to deliver the news in an ironic way (satirical news). Others, such as propaganda, hoaxes, and clickbaits, are spread to mislead the audience or to manipulate their opinions. In the case of Twitter, suspicious news annotations should be done on a tweet rather than an account level, since some accounts mix fake with real news. However, these annotations are extremely costly and time consuming \u2013 i.e., due to high volume of available tweets Consequently, a first step in this direction, e.g., as a pre-filtering step, can be viewed as the task of detecting fake news at the account level.\nThe main obstacle for detecting suspicious Twitter accounts is due to the behavior of mixing some real news with the misleading ones. Consequently, we investigate ways to detect suspicious accounts by considering their tweets in groups (chunks). Our hypothesis is that suspicious accounts have a unique pattern in posting tweet sequences. Since their intention is to mislead, the way they transition from one set of tweets to the next has a hidden signature, biased by their intentions. Therefore, reading these tweets in chunks has the potential to improve the detection of the fake news accounts.\nIn this work, we investigate the problem of discriminating between factual and non-factual accounts in Twitter. To this end, we collect a large dataset of tweets using a list of propaganda, hoax and clickbait accounts and compare different versions of sequential chunk-based approaches using a variety of feature sets against several baselines. Several approaches have been proposed for news verification, whether in social media (rumors detection) BIBREF0, BIBREF1, BIBREF2, BIBREF3, BIBREF4, or in news claims BIBREF5, BIBREF6, BIBREF7, BIBREF8. The main orientation in the previous works is to verify the textual claims/tweets but not their sources. To the best of our knowledge, this is the first work aiming to detect factuality at the account level, and especially from a textual perspective. Our contributions are:\n[leftmargin=4mm]\nWe propose an approach to detect non-factual Twitter accounts by treating post streams as a sequence of tweets' chunks. We test several semantic and dictionary-based features together with a neural sequential approach, and apply an ablation test to investigate their contribution.\nWe benchmark our approach against other approaches that discard the chronological order of the tweets or read the tweets individually. The results show that our approach produces superior results at detecting non-factual accounts.\nMethodology\nGiven a news Twitter account, we read its tweets from the account's timeline. Then we sort the tweets by the posting date in ascending way and we split them into $N$ chunks. Each chunk consists of a sorted sequence of tweets labeled by the label of its corresponding account. We extract a set of features from each chunk and we feed them into a recurrent neural network to model the sequential flow of the chunks' tweets. We use an attention layer with dropout to attend over the most important tweets in each chunk. Finally, the representation is fed into a softmax layer to produce a probability distribution over the account types and thus predict the factuality of the accounts. Since we have many chunks for each account, the label for an account is obtained by taking the majority class of the account's chunks.\nInput Representation. Let $t$ be a Twitter account that contains $m$ tweets. These tweets are sorted by date and split into a sequence of chunks $ck = \\langle ck_1, \\ldots , ck_n \\rangle $, where each $ck_i$ contains $s$ tweets. Each tweet in $ck_i$ is represented by a vector $v \\in {\\rm I\\!R}^d$ , where $v$ is the concatenation of a set of features' vectors, that is $v = \\langle f_1, \\ldots , f_n \\rangle $. Each feature vector $f_i$ is built by counting the presence of tweet's words in a set of lexical lists. The final representation of the tweet is built by averaging the single word vectors.\nFeatures. We argue that different kinds of features like the sentiment of the text, morality, and other text-based features are critical to detect the nonfactual Twitter accounts by utilizing their occurrence during reporting the news in an account's timeline. We employ a rich set of features borrowed from previous works in fake news, bias, and rumors detection BIBREF0, BIBREF1, BIBREF8, BIBREF9.\n[leftmargin=4mm]\nEmotion: We build an emotions vector using word occurrences of 15 emotion types from two available emotional lexicons. We use the NRC lexicon BIBREF10, which contains $\\sim $14K words labeled using the eight Plutchik's emotions BIBREF11. The other lexicon is SentiSense BIBREF12 which is a concept-based affective lexicon that attaches emotional meanings to concepts from the WordNet lexical database. It has $\\sim $5.5 words labeled with emotions from a set of 14 emotional categories We use the categories that do not exist in the NRC lexicon.\nSentiment: We extract the sentiment of the tweets by employing EffectWordNet BIBREF13, SenticNet BIBREF14, NRC BIBREF10, and subj_lexicon BIBREF15, where each has the two sentiment classes, positive and negative.\nMorality: Features based on morality foundation theory BIBREF16 where words are labeled in one of the following 10 categories (care, harm, fairness, cheating, loyalty, betrayal, authority, subversion, sanctity, and degradation).\nStyle: We use canonical stylistic features, such as the count of question marks, exclamation marks, consecutive characters and letters, links, hashtags, users' mentions. In addition, we extract the uppercase ratio and the tweet length.\nWords embeddings: We extract words embeddings of the words of the tweet using $Glove\\-840B-300d$ BIBREF17 pretrained model. The tweet final representation is obtained by averaging its words embeddings.\nModel. To account for chunk sequences we make use of a de facto standard approach and opt for a recurrent neural model using long short-term memory (LSTM) BIBREF18. In our model, the sequence consists of a sequence of tweets belonging to one chunk (Figure FIGREF11). The LSTM learns the hidden state $h\\textsubscript {t}$ by capturing the previous timesteps (past tweets). The produced hidden state $h\\textsubscript {t}$ at each time step is passed to the attention layer which computes a `context' vector $c\\textsubscript {t}$ as the weighted mean of the state sequence $h$ by:\nWhere $T$ is the total number of timesteps in the input sequence and $\\alpha \\textsubscript {tj}$ is a weight computed at each time step $j$ for each state hj.\nExperiments and Results\nData. We build a dataset of Twitter accounts based on two lists annotated in previous works. For the non-factual accounts, we rely on a list of 180 Twitter accounts from BIBREF1. This list was created based on public resources where suspicious Twitter accounts were annotated with the main fake news types (clickbait, propaganda, satire, and hoax). We discard the satire labeled accounts since their intention is not to mislead or deceive. On the other hand, for the factual accounts, we use a list with another 32 Twitter accounts from BIBREF19 that are considered trustworthy by independent third parties. We discard some accounts that publish news in languages other than English (e.g., Russian or Arabic). Moreover, to ensure the quality of the data, we remove the duplicate, media-based, and link-only tweets. For each account, we collect the maximum amount of tweets allowed by Twitter API. Table TABREF13 presents statistics on our dataset.\nBaselines. We compare our approach (FacTweet) to the following set of baselines:\n[leftmargin=4mm]\nLR + Bag-of-words: We aggregate the tweets of a feed and we use a bag-of-words representation with a logistic regression (LR) classifier.\nTweet2vec: We use the Bidirectional Gated recurrent neural network model proposed in BIBREF20. We keep the default parameters that were provided with the implementation. To represent the tweets, we use the decoded embedding produced by the model. With this baseline we aim at assessing if the tweets' hashtags may help detecting the non-factual accounts.\nLR + All Features (tweet-level): We extract all our features from each tweet and feed them into a LR classifier. Here, we do not aggregate over tweets and thus view each tweet independently.\nLR + All Features (chunk-level): We concatenate the features' vectors of the tweets in a chunk and feed them into a LR classifier.\nFacTweet (tweet-level): Similar to the FacTweet approach, but at tweet-level; the sequential flow of the tweets is not utilized. We aim at investigating the importance of the sequential flow of tweets.\nTop-$k$ replies, likes, or re-tweets: Some approaches in rumors detection use the number of replies, likes, and re-tweets to detect rumors BIBREF21. Thus, we extract top $k$ replied, liked or re-tweeted tweets from each account to assess the accounts factuality. We tested different $k$ values between 10 tweets to the max number of tweets from each account. Figure FIGREF24 shows the macro-F1 values for different $k$ values. It seems that $k=500$ for the top replied tweets achieves the highest result. Therefore, we consider this as a baseline.\nExperimental Setup. We apply a 5 cross-validation on the account's level. For the FacTweet model, we experiment with 25% of the accounts for validation and parameters selection. We use hyperopt library to select the hyper-parameters on the following values: LSTM layer size (16, 32, 64), dropout ($0.0-0.9$), activation function ($relu$, $selu$, $tanh$), optimizer ($sgd$, $adam$, $rmsprop$) with varying the value of the learning rate (1e-1,..,-5), and batch size (4, 8, 16). The validation split is extracted on the class level using stratified sampling: we took a random 25% of the accounts from each class since the dataset is unbalanced. Discarding the classes' size in the splitting process may affect the minority classes (e.g. hoax). For the baselines' classifier, we tested many classifiers and the LR showed the best overall performance.\nResults. Table TABREF25 presents the results. We present the results using a chunk size of 20, which was found to be the best size on the held-out data. Figure FIGREF24 shows the results of different chunks sizes. FacTweet performs better than the proposed baselines and obtains the highest macro-F1 value of $0.565$. Our results indicate the importance of taking into account the sequence of the tweets in the accounts' timelines. The sequence of these tweets is better captured by our proposed model sequence-agnostic or non-neural classifiers. Moreover, the results demonstrate that the features at tweet-level do not perform well to detect the Twitter accounts factuality, since they obtain a result near to the majority class ($0.18$). Another finding from our experiments shows that the performance of the Tweet2vec is weak. This demonstrates that tweets' hashtags are not informative to detect non-factual accounts. In Table TABREF25, we present ablation tests so as to quantify the contribution of subset of features. The results indicate that most performance gains come from words embeddings, style, and morality features. Other features (emotion and sentiment) show lower importance: nevertheless, they still improve the overall system performance (on average 0.35% Macro-F$_1$ improvement). These performance figures suggest that non-factual accounts use semantic and stylistic hidden signatures mostly while tweeting news, so as to be able to mislead the readers and behave as reputable (i.e., factual) sources. We leave a more fine-grained, diachronic analysis of semantic and stylistic features \u2013 how semantic and stylistic signature evolve across time and change across the accounts' timelines \u2013 for future work.\nConclusions\nIn this paper, we proposed a model that utilizes chunked timelines of tweets and a recurrent neural model in order to infer the factuality of a Twitter news account. Our experimental results indicate the importance of analyzing tweet stream into chunks, as well as the benefits of heterogeneous knowledge source (i.e., lexica as well as text) in order to capture factuality. In future work, we would like to extend this line of research with further in-depth analysis to understand the flow change of the used features in the accounts' streams. Moreover, we would like to take our approach one step further incorporating explicit temporal information, e.g., using timestamps. Crucially, we are also interested in developing a multilingual version of our approach, for instance by leveraging the now ubiquitous cross-lingual embeddings BIBREF22, BIBREF23.", "source": "scientific_qa", "evaluation": "LLM"}
{"instructions": ["What is the approach of previous work?", "Is the lexicon the same for all languages?", "How do they obtain the lexicon?", "What evaluation metric is used?", "Which languages are similar to each other?", "Which datasets are employed for South African languages LID?", "Does the paper report the performance of a baseline model on South African languages LID?", "Does the algorithm improve on the state-of-the-art methods?"], "outputs": ["'shallow' naive Bayes, SVM, hierarchical stacked classifiers, bidirectional recurrent neural networks", "Yes", "built over all the data and therefore includes the vocabulary from both the training and testing sets", "average classification accuracy, execution performance", "Nguni languages (zul, xho, nbl, ssw), Sotho languages (nso, sot, tsn)", "DSL 2015, DSL 2017, JW300 parallel corpus , NCHLT text corpora", "Yes", "Yes"], "input": "Introduction\nAccurate language identification (LID) is the first step in many natural language processing and machine comprehension pipelines. If the language of a piece of text is known then the appropriate downstream models like parts of speech taggers and language models can be applied as required.\nLID is further also an important step in harvesting scarce language resources. Harvested data can be used to bootstrap more accurate LID models and in doing so continually improve the quality of the harvested data. Availability of data is still one of the big roadblocks for applying data driven approaches like supervised machine learning in developing countries.\nHaving 11 official languages of South Africa has lead to initiatives (discussed in the next section) that have had positive effect on the availability of language resources for research. However, many of the South African languages are still under resourced from the point of view of building data driven models for machine comprehension and process automation.\nTable TABREF2 shows the percentages of first language speakers for each of the official languages of South Africa. These are four conjunctively written Nguni languages (zul, xho, nbl, ssw), Afrikaans (afr) and English (eng), three disjunctively written Sotho languages (nso, sot, tsn), as well as tshiVenda (ven) and Xitsonga (tso). The Nguni languages are similar to each other and harder to distinguish. The same is true of the Sotho languages.\nThis paper presents a hierarchical naive Bayesian and lexicon based classifier for LID of short pieces of text of 15-20 characters long. The algorithm is evaluated against recent approaches using existing test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) 2015 and 2017 shared tasks.\nSection SECREF2 reviews existing works on the topic and summarises the remaining research problems. Section SECREF3 of the paper discusses the proposed algorithm and Section SECREF4 presents comparative results.\nRelated Works\nThe focus of this section is on recently published datasets and LID research applicable to the South African context. An in depth survey of algorithms, features, datasets, shared tasks and evaluation methods may be found in BIBREF0.\nThe datasets for the DSL 2015 & DSL 2017 shared tasks BIBREF1 are often used in LID benchmarks and also available on Kaggle . The DSL datasets, like other LID datasets, consists of text sentences labelled by language. The 2017 dataset, for example, contains 14 languages over 6 language groups with 18000 training samples and 1000 testing samples per language.\nThe recently published JW300 parallel corpus BIBREF2 covers over 300 languages with around 100 thousand parallel sentences per language pair on average. In South Africa, a multilingual corpus of academic texts produced by university students with different mother tongues is being developed BIBREF3. The WiLI-2018 benchmark dataset BIBREF4 for monolingual written natural language identification includes around 1000 paragraphs of 235 languages. A possibly useful link can also be made BIBREF5 between Native Language Identification (NLI) (determining the native language of the author of a text) and Language Variety Identification (LVI) (classification of different varieties of a single language) which opens up more datasets. The Leipzig Corpora Collection BIBREF6, the Universal Declaration of Human Rights and Tatoeba are also often used sources of data.\nThe NCHLT text corpora BIBREF7 is likely a good starting point for a shared LID task dataset for the South African languages BIBREF8. The NCHLT text corpora contains enough data to have 3500 training samples and 600 testing samples of 300+ character sentences per language. Researchers have recently started applying existing algorithms for tasks like neural machine translation in earnest to such South African language datasets BIBREF9.\nExisting NLP datasets, models and services BIBREF10 are available for South African languages. These include an LID algorithm BIBREF11 that uses a character level n-gram language model. Multiple papers have shown that 'shallow' naive Bayes classifiers BIBREF12, BIBREF8, BIBREF13, BIBREF14, SVMs BIBREF15 and similar models work very well for doing LID. The DSL 2017 paper BIBREF1, for example, gives an overview of the solutions of all of the teams that competed on the shared task and the winning approach BIBREF16 used an SVM with character n-gram, parts of speech tag features and some other engineered features. The winning approach for DSL 2015 used an ensemble naive Bayes classifier. The fasttext classifier BIBREF17 is perhaps one of the best known efficient 'shallow' text classifiers that have been used for LID .\nMultiple papers have proposed hierarchical stacked classifiers (including lexicons) that would for example first classify a piece of text by language group and then by exact language BIBREF18, BIBREF19, BIBREF8, BIBREF0. Some work has also been done on classifying surnames between Tshivenda, Xitsonga and Sepedi BIBREF20. Additionally, data augmentation BIBREF21 and adversarial training BIBREF22 approaches are potentially very useful to reduce the requirement for data.\nResearchers have investigated deeper LID models like bidirectional recurrent neural networks BIBREF23 or ensembles of recurrent neural networks BIBREF24. The latter is reported to achieve 95.12% in the DSL 2015 shared task. In these models text features can include character and word n-grams as well as informative character and word-level features learnt BIBREF25 from the training data. The neural methods seem to work well in tasks where more training data is available.\nIn summary, LID of short texts, informal styles and similar languages remains a difficult problem which is actively being researched. Increased confusion can in general be expected between shorter pieces of text and languages that are more closely related. Shallow methods still seem to work well compared to deeper models for LID. Other remaining research opportunities seem to be data harvesting, building standardised datasets and creating shared tasks for South Africa and Africa. Support for language codes that include more languages seems to be growing and discoverability of research is improving with more survey papers coming out. Paywalls also seem to no longer be a problem; the references used in this paper was either openly published or available as preprint papers.\nMethodology\nThe proposed LID algorithm builds on the work in BIBREF8 and BIBREF26. We apply a naive Bayesian classifier with character (2, 4 & 6)-grams, word unigram and word bigram features with a hierarchical lexicon based classifier.\nThe naive Bayesian classifier is trained to predict the specific language label of a piece of text, but used to first classify text as belonging to either the Nguni family, the Sotho family, English, Afrikaans, Xitsonga or Tshivenda. The scikit-learn multinomial naive Bayes classifier is used for the implementation with an alpha smoothing value of 0.01 and hashed text features.\nThe lexicon based classifier is then used to predict the specific language within a language group. For the South African languages this is done for the Nguni and Sotho groups. If the lexicon prediction of the specific language has high confidence then its result is used as the final label else the naive Bayesian classifier's specific language prediction is used as the final result. The lexicon is built over all the data and therefore includes the vocabulary from both the training and testing sets.\nThe lexicon based classifier is designed to trade higher precision for lower recall. The proposed implementation is considered confident if the number of words from the winning language is at least one more than the number of words considered to be from the language scored in second place.\nThe stacked classifier is tested against three public LID implementations BIBREF17, BIBREF23, BIBREF8. The LID implementation described in BIBREF17 is available on GitHub and is trained and tested according to a post on the fasttext blog. Character (5-6)-gram features with 16 dimensional vectors worked the best. The implementation discussed in BIBREF23 is available from https://github.com/tomkocmi/LanideNN. Following the instructions for an OSX pip install of an old r0.8 release of TensorFlow, the LanideNN code could be executed in Python 3.7.4. Settings were left at their defaults and a learning rate of 0.001 was used followed by a refinement with learning rate of 0.0001. Only one code modification was applied to return the results from a method that previously just printed to screen. The LID algorithm described in BIBREF8 is also available on GitHub.\nThe stacked classifier is also tested against the results reported for four other algorithms BIBREF16, BIBREF26, BIBREF24, BIBREF15. All the comparisons are done using the NCHLT BIBREF7, DSL 2015 BIBREF19 and DSL 2017 BIBREF1 datasets discussed in Section SECREF2.\nResults and Analysis\nThe average classification accuracy results are summarised in Table TABREF9. The accuracies reported are for classifying a piece of text by its specific language label. Classifying text only by language group or family is a much easier task as reported in BIBREF8.\nDifferent variations of the proposed classifier were evaluated. A single NB classifier (NB), a stack of two NB classifiers (NB+NB), a stack of a NB classifier and lexicon (NB+Lex) and a lexicon (Lex) by itself. A lexicon with a 50% training token dropout is also listed to show the impact of the lexicon support on the accuracy.\nFrom the results it seems that the DSL 2017 task might be harder than the DSL 2015 and NCHLT tasks. Also, the results for the implementation discussed in BIBREF23 might seem low, but the results reported in that paper is generated on longer pieces of text so lower scores on the shorter pieces of text derived from the NCHLT corpora is expected.\nThe accuracy of the proposed algorithm seems to be dependent on the support of the lexicon. Without a good lexicon a non-stacked naive Bayesian classifier might even perform better.\nThe execution performance of some of the LID implementations are shown in Table TABREF10. Results were generated on an early 2015 13-inch Retina MacBook Pro with a 2.9 GHz CPU (Turbo Boosted to 3.4 GHz) and 8GB RAM. The C++ implementation in BIBREF17 is the fastest. The implementation in BIBREF8 makes use of un-hashed feature representations which causes it to be slower than the proposed sklearn implementation. The execution performance of BIBREF23 might improve by a factor of five to ten when executed on a GPU.\nConclusion\nLID of short texts, informal styles and similar languages remains a difficult problem which is actively being researched. The proposed algorithm was evaluated on three existing datasets and compared to the implementations of three public LID implementations as well as to reported results of four other algorithms. It performed well relative to the other methods beating their results. However, the performance is dependent on the support of the lexicon.\nWe would like to investigate the value of a lexicon in a production system and how to possibly maintain it using self-supervised learning. We are investigating the application of deeper language models some of which have been used in more recent DSL shared tasks. We would also like to investigate data augmentation strategies to reduce the amount of training data that is required.\nFurther research opportunities include data harvesting, building standardised datasets and shared tasks for South Africa as well as the rest of Africa. In general, the support for language codes that include more languages seems to be growing, discoverability of research is improving and paywalls seem to no longer be a big problem in getting access to published research.", "source": "scientific_qa", "evaluation": "human"}
